Delivering predictive models to customer facing outlets is hard. On the one hand the system should avoid stifling the data scientists’ creative thinking, whilst at the same time maintain SLA’s on performance and reliability. This article discusses two approaches to implementing such a system.
The previous article outlined what a delivery system is, and what it should do. This article will dive into various strategies on how to implement such a system.
Combining Machine Learning models with human curated knowledge
Data scientists train and score a vast variety of models, with the ultimate goal of impacting business in some meaningful way. We tend to think of models exclusively as complicated Machine Learning or statistical algorithms. This category of models identifies patterns in observed data, builds models that explain the world and predicts things without having explicit pre-programmed rules. Often times, these ML models are crafted carefully by selecting features and hyper-parameters that best match the domain, as understood by collaborative efforts between data scientists and domain experts.
Encoding all these factors in Machine Learning models, albeit perhaps theoretically possible, would be unfeasibly costly for most organisations.
This article also considers another type of model: human knowledge. Many customer facing outlets such as news websites, VOD platforms or e-commerce platforms in some way present to the end-user a list of what domain experts — often product owners or editors — deem relevant in the form of ‘editors’ picks’- or ‘most popular’-sections and such. Factors in these decisions include, among others, implicit knowledge accumulated over years of experience about which items will ‘work well’, marketing goals, strategic long-term considerations, and so forth. Encoding all these factors in Machine Learning models, albeit perhaps theoretically possible, would be unfeasibly costly for most organisations.
Integrating, or blending, both these model types with the product and delivering predictions to end-users by showing personalised recommendations, offers, ad-placement or UI has the potential to increase customer engagement significantly. In fact, at Primed we see in practice that the combination of the two yields better results than the sum of the parts. In order to achieve just that, models need to be ‘delivered’ to the frontends. Unfortunately, building and maintaining ML systems is not easy. The following sections describe two possible approaches to this delivery problem, and outlines benefits and disadvantages.
Serialising models
A popular way of deploying models to production is for data scientists to serialise the model in some way and make the serialised representation available to product teams. Production environments can then deserialise the model, and expose it via (REST) API’s. Frontends then call these API’s to retrieve predictions as needed. Python based approaches often use Pickle and Flask, R based approaches use Plumber or DeployR. Alternatively, Dockerized approaches can facilitate both R and Python models, as well as more exotic tooling.
This approach has the benefit of having a very low barrier to start: there are quite some tutorials and blogposts available. Meanwhile, data scientists can continue to work in environments and tools they are familiar with whilst product teams are equally used to working with (REST) API’s.
There are a few drawbacks:
SLA’s on performance and reliability are (partially) placed with the data scientists. These SLA’s are crucial: every 100ms of latency costs Amazon 1% of profit. R&D and the experimental character of data science is fundamentally incompatible with strict guarantees on latencies, robustness and throughput. Effectively, every time a new model is deployed, this approach brings the risk of adversely impacting end-user experience.
Serialisation is highly tied to technology used, reducing portability and flexibility towards future frameworks and tools. This in turn will make it hard for data science departments to continue researching and developing over time.
Only a subset of models can readily be delivered via this method. In particular functions (models that operate solely on the inputs) can be represented cleanly using this approach. Models that compare inputs with other items, e.g. content based filters, will be harder to represent cleanly.
Careless API design quickly develop into frontend development dependencies. Adding a new model type, or even changing an existing model can trickle down to frontend changes. This in turn can slow down data science development efforts considerably, at the same time wearing down frontend trust in the data science solutions.
Pre-calculating models
Alternatively, predictions can be pre-calculated and subsequently fetched from a fast look-up store. Approaches often use Redis or another in-memory store to perform these very fast lookups. Also here, there will need to be a (REST) API to enable easy access, using a familiar syntax for frontends.
What is unique about this approach is that it allows for guaranteeing latency and throughput, making it very suitable for customer-facing applications. It does so without restricting the data scientist. This means the data scientist can continue working with their tool of preference (R, Python, Spark, Sci-kit learn, etc.), build any type of model whilst being free of the burden of engineering issues such as latency, scalability, throughput, security, uptime, etc.
On the downside, pre-calculating as well as synchronising these predictions to the store is expensive. Especially ‘ensemble’ models, or blends, quickly balloon into huge matrices that become unmanageable, if not properly engineered. Building this type of system from scratch such that it supports blending of ML and human expertise as well as systematic A/B testing is hard. Taking into account that it operates well in the customer-facing domain, aspect such as security, scalability, up-times, monitoring and client-side integration further complicate the engineering task.
Which one works best?
Clearly, there is no ‘best’ solution and as with many things it depends on what you want to achieve. In general, use cases with a small amount of models that do not tend to change a lot benefit from adopting the serialisation approach. It is quick to setup, and many resources and products already exist.
However, if there are hard requirements for the models to exist in a customer-facing environment this may not be the right approach: eventually the lack of strict guarantees on latencies and throughput will cause reduced UX or even downtime for end-users. In this latter case, where these strict SLA’s are required, pre-calculating predictions and storing them for fast lookup could be the better approach.
Do you think this was interesting, but you’re also convinced you can do better? That’s great, because then we want to talk to you! We’re hiring :) Have a look at our website @ https://primed.io/careers/ or send us an email on info@primed.io