You shipped your machine learning model, and it’s starting to interact with real users. Congratulations on not being part of a (possibly made up) statistic about 87% of models never making it into production.

Unfortunately, no matter how good your model was offline, it may not remain that way for long in the real world. It’s time to consider monitoring the model and refining it in production.

Unfortunately, for all of the best practices that have emerged for building models, there are no best practices on how to manage your models once they are deployed. To make matters worse, the internet has loads of bad advice on this topic.

Model monitoring is a necessary (but not sufficient) part of operating production models. In this post, we’ll explore what actually works for model monitoring based on our experience working with dozens of companies, teaching thousands of students, and immersing ourselves in the research literature. We’ll also debunk some of the bad advice you might have heard.

What works

Forget drift; measure what matters

If you’re operating an ML-powered product, it’s critical to measure how well it’s working. And by far the most important signal to look at is outcomes or feedback on your model’s predictions. From actual users. All other metrics are just proxies.

This isn’t specific to recommender systems. In fact, if I was president of the ML blogosphere for a day, my first act would be to ban all “model monitoring” advice that doesn’t start with this.

Designing your ML-powered products to include feedback loops with end-users may seem like overkill, but if you look closely at the best products you’ll start to see it everywhere, from Tesla tracking how often you intervene with autopilot to Google photos asking you to verify if two people are the same.

What if it’s impractical to gather feedback from your users? Redesign your product. I’m serious.

A company I spoke with recently was building a tool to recommend interventions teachers could take based on different aspects of student performance. When they realized that there was no way to connect the interventions back to the model's suggestions, the company decided to redesign the entire UI to change the way the recommendations were surfaced to the teachers: all in service of making it easier to create a feedback loop.

Can’t get feedback from users? Get a proxy instead

I’ll admit that changing your product may not be the most pragmatic suggestion. If you can’t gather feedback from users, the next best thing to do is label some production data to measure model performance metrics.

Even if you don’t have the time or budget to set up a full-fledged labeling effort, you can get a lot of value by manually labeling a small number of data points every day. Throw a “labeling party” every couple of days where your team labels a few data points together. Shreya Shankar recommends setting this up as an on-call rotation. As she points out, something is better than nothing here.

If labels are hard to come by, consider designing one or more problem-specific proxy metrics. For example, some of our customers working in natural language generation measure properties of the generated text like toxicity and repetitiveness that are associated with a poor user experience. Similarly, in this excellent talk, Lina Weichbrodt suggests using the share of personalized responses and share of empty / fallback responses as proxies for personalization quality.

If you’re looking for ideas to develop proxy metrics, try doing some error analysis on the labeled data you do have to find problematic inputs or outputs. Then use them to design proxies that would have detected those points. Functions that noisily detect edge cases make good proxy metrics, and as a side effect can help you find more of those edge cases to use the next time you train. In a pinch, you can even use your model’s confidence, though that approach has some theoretical issues.

Drill down to find negative feedback loops

You’ve checked your model’s accuracy and your user feedback, and the model appears to be working. That’s great! Unfortunately, critical performance problems may still lurk beneath the surface.

If you’re familiar with monitoring web applications, you know that we care about metrics like the 99th percentile for latency not because we worry about what happens to a user once per hundred queries, but rather because for some users that might be the latency they always experience. For machine learning applications, these poor subgroup experiences are even worse because they can lead to negative feedback loops.

For example, say you’re building a bot detection system for a social networking site. You decide to monitor the false positive rate – the fraction of flagged accounts that are not actually bots. One day you ship a new natural language understanding model that shows fewer false positives. But there’s a bug – your model performs worse in German than it did before. Now a higher percentage of your false positives are from users in Germany. The German users get frustrated that their posts are being flagged and churn at a higher rate. To make matters worse, this causes your metric to improve, since you now have a lower proportion of users speaking the worse-performing language on the platform. You’ve created a negative feedback loop.

A simple but effective way to avoid unintended consequences of deploying a model that performs better in aggregate but worse on a subgroup is to monitor performance on all subgroups that are important to your business. Common ways to subgroup model performance include account age, location, or different use cases of your product.

A particularly important use case for measuring subgroup performance is to detect and mitigate potential biased or unfair model behavior. Models tend to reflect (or sometimes amplify) the bias present in the data they are trained on. Detecting and mitigating bias is an active research area that we can’t do justice to in this post. However, slicing your performance metrics along sensitive categories can be an effective first step toward finding bias present in your model.

As a nice side effect, drilling down to understand performance on subgroups is an effective way to do error analysis to debug issues and make your model better.

What doesn’t work

Don’t over-value data drift

There’s a whole category of bad advice on the internet that can be lumped together as the “drift-centric” view of ML monitoring. You’ve probably seen the blog posts – they frame the problem of ML monitoring as detecting data (or concept) drift, and then talk about all of the metrics you can use to do so. Just compute the KL divergence between training and production on all of your features, that’s all you need, right?

Wrong. Detecting data drift is not an end in itself. Good models are robust to some degree of distribution shift. Figure 1 shows a toy example to illustrate the point. Our classifier performs equally well on our validation points as on the simulated “production” datapoints, despite obvious drift in both of its features.

Figure 1. The upper left shows the classification results of a binary classifier trained on data similar sampled from the same distribution as the red points. The difference between red and grey background indicates the classification boundary. The blue data points are data sampled from a significantly different distribution, yet the classifier still does a good job classifying them. The right and bottom charts show the marginal distributions of validation and simulated "production" data points along each feature.

A more visually striking example is some work I did in my PhD on domain randomization, a method we used to train computer vision models that work in the real world despite only being trained on super low-fidelity rendered images like those in Figure 2. Our models performed well on real robots despite a large and obvious domain shift, showing that domain shift does not necessarily lead to poor performance.

Figure 2. Domain randomization is a technique used to train models that perform well in the real world despite being trained on unrealistic simulated images from a completely different data distribution.

Don’t get me wrong, big distribution shifts can impact your model’s performance, and looking at drifted features can be an effective debugging tool if you suspect there’s a problem. But it’s hard to tell a priori whether that KL divergence of 0.37 that you calculated should be considered big.

Don’t set alerts on meaningless metrics

The difficulty picking thresholds for drift brings us to a corollary: don’t set alerts on metrics that don’t matter. Among teams that I’ve spoken to, alert fatigue is one of the main reasons ML monitoring solutions lose their effectiveness. Setting alerts on noisy, hard-to-interpret metrics like the KL divergence between training and prod for a particular feature may be the biggest culprit. No one wants to be woken up in the middle of the night by a KL divergence.

While we’re on the topic of our friend the KL divergence, let’s get into the weeds a bit and talk about why everyone’s favorite not-a-metric metric isn’t the best way to measure distribution shift to begin with. As a reminder, the KL divergence between two probability distributions P and Q is defined as:

Let’s say you want to estimate the KL divergence for a feature from samples. The fraction p(x) / q(x) inside of the log means that the KL is very sensitive to values that are very small in q(x) (say 0.0001% of values) and slightly less small in p(x) (say 0.01% of values). That’s not ideal because unless you have a massive amount of data, those estimates of p(x) and q(x) are probably noisy, and they represent rare data which may not even be relevant to your model’s performance.

The KL divergence isn’t the worst offender in the category of frequently recommended bad drift metrics, though. That dubious honor goes to two-sample statistical tests like the Kolmogorov-Smirnov (KS) test. The (faulty) logic of using such tests goes “I don’t know what threshold to apply to my distribution shift metric, so I’ll Do Science™️ and fire an alert if p < 0.05 because that’s what an old eugenicist said was ‘convenient’ in the 1950s”.

The problem is that these tests measure the wrong thing. They tell you how likely it is that your two samples (e.g., your train and prod) are drawn from the same probability distribution. If you have a lot of samples, then any tiny difference in the two distributions will result in p << 0.05. A 1% increase in the number of married people applying for loans probably won’t affect your model’s performance but it will definitely cause your KS test alert to fire.

None of this is to suggest not measuring drift or even setting alerts on it if you have reason to believe a particular feature will drift and cause your model performance to degrade. But if you do, at least choose a sensible metric (the D_1 distance from the Google data validation paper is a good choice, or the Earth mover’s distance if you’re a connoisseur). And, for the love of science, please don’t apply a p-value to it.

Don’t forget that ML systems are more than just models

If you saw the math in the last section and skipped right down here, then (1) hi 👋 (2) fear not, monitoring machine learning systems is not all about math. In fact, a common model monitoring mistake is focusing on the math before you have some basics in place.

Machine learning models are just one part of a broader application powered by code, data, and model together. In this review of 15 years of outages for a particular ML application at Google, the authors found that ⅔ of outages were caused by systems issues that had nothing to do with ML, and most of the ML-related issues were data problems. Your exact breakdown of failures may look different than Google’s, but monitoring data quality and service quality are prerequisites for knowing if your model is actually working well.

One effective heuristic is to pay attention to the events that are likely to cause changes in model performance. For example, if you’re rolling out a new product feature or expanding to a new customer base, you should expect your model performance to change. Similarly, code or data pipeline changes and model deploys should trigger you to make sure your model still does what you expect it to.

Conclusion

Machine learning models are part of a broader application powered by code and data and designed to serve end-users. If you monitor fine-grained metrics (or good proxies) about your users, data, and associated systems, you’ll get the first step of continual improvement right: knowing that the system is working right to begin with.

Unfortunately, just knowing there’s an issue with your model isn’t very actionable - you need tools and processes to fix the issue. And unlike traditional software, in ML the tools you need – to debug your issues, fix them, validate the fix, and test the system to avoid repeating mistakes – don’t exist yet.

That’s why at Gantry we’re building a continual ML improvement platform. The goal is not only to help you detect issues that result in poor user experience, but also to help you fix them so your models get better continuously as they interact with your end users.

Book a demo if you’re interested in giving it a try!