MLDemon: cheaper monitoring of production models

Part of an ongoing series highlighting insights from papers that have contributed to the development of best practices for production ML

Josh Tobin
Josh TobinOctober 27, 2022

Say you’ve deployed a model and want to see how it’s performing in production. As we’ve discussed in the past, the commonly recommended approach of detecting data drift can be misleading: drift isn’t a reliable predictor of how your model is performing.

Manually assessing model performance by periodically labeling data is often a better idea.

However, this approach is expensive and time consuming. This paper proposes a way to reduce that cost by more intelligently deciding when labeling new data is worthwhile.

Production ML Papers to Know

This is a continuation of Production ML Papers to Know, a series from Gantry highlighting papers we think have been important to the evolving practice of production ML.

We have covered a few papers already in our newsletter, Continual Learnings, and on Twitter. Due to the positive reception we decided to turn these into blog posts.

MLDemon: cheaper monitoring of production models


Figure 1: schematic for MLDemon taken from the paper
Figure 1: schematic for MLDemon taken from the paper

How can we move from periodically labeling some data to doing so programmatically? As our model makes new predictions, rather than asking an expert for labels directly, we can instead query an algorithm called the Demon. The Demon tells us whether our model performance is acceptable, only asking the expert for labels if it deems it necessary.

How does the Demon work? Just like in the baseline, we ask the expert for labels periodically. We also use the model’s confidence to compute an anomaly score for each of those labeled points. Then we fit a linear model that uses the anomaly score to predict the accuracy drift on new unlabeled data. As more data comes in, if the linear model is sufficiently confident that accuracy is stable, we can decrease the frequency with which we need to ask for more labels.

This confidence level can be tuned by setting the monitoring risk, which determines the risk appetite for the algorithm.


Figure 2: MLDemon results presented in the paper (shown here for two different monitoring risk values).
Figure 2: MLDemon results presented in the paper (shown here for two different monitoring risk values).

The paper assesses the performance of MLDemon against two baselines:

  • PQ (periodic querying) asks the expert for labels on a fixed period
  • RR (Request and reverify) fixes a threshold on the anomaly score and asks the expert for labels whenever the threshold is exceeded

The authors use 3 different datasets, and different values for monitoring risk. Performance is assessed by the number of label queries requested on the dataset.

MLDemon is able to perform at least as well as PQ and RR methods. It is able to outperform RR when the anomaly detector is errant by bring flexible enough to implement a periodic querying policy. But, when MLDemon is able to build confidence in the data and the model - for example when there are gradual changes in the data - it requires much less querying than the other approaches.

In short, MLDemon’s downside is limited to whatever is better of PQ and RR, while the upside is potentially much better.

The upshot

If labeling data for the purpose of monitoring your models is cost-prohibitive, this approach might be worth trying.

However, it is strictly focussed on monitoring and does not consider how the approach could be integrated into model retraining or other aspects of ML Ops.

And it does not come with an accompanying repo, which would help not just with implementation but also understanding.

You can find the paper here.