Approach

Figure 1: schematic for MLDemon taken from the paper

How can we move from periodically labeling some data to doing so programmatically? As our model makes new predictions, rather than asking an expert for labels directly, we can instead query an algorithm called the Demon. The Demon tells us whether our model performance is acceptable, only asking the expert for labels if it deems it necessary.

How does the Demon work? Just like in the baseline, we ask the expert for labels periodically. We also use the model’s confidence to compute an anomaly score for each of those labeled points. Then we fit a linear model that uses the anomaly score to predict the accuracy drift on new unlabeled data. As more data comes in, if the linear model is sufficiently confident that accuracy is stable, we can decrease the frequency with which we need to ask for more labels.

This confidence level can be tuned by setting the monitoring risk, which determines the risk appetite for the algorithm.

Findings

Figure 2: MLDemon results presented in the paper (shown here for two different monitoring risk values).

The paper assesses the performance of MLDemon against two baselines:

  • PQ (periodic querying) asks the expert for labels on a fixed period
  • RR (Request and reverify) fixes a threshold on the anomaly score and asks the expert for labels whenever the threshold is exceeded

The authors use 3 different datasets, and different values for monitoring risk. Performance is assessed by the number of label queries requested on the dataset.

MLDemon is able to perform at least as well as PQ and RR methods. It is able to outperform RR when the anomaly detector is errant by bring flexible enough to implement a periodic querying policy. But, when MLDemon is able to build confidence in the data and the model - for example when there are gradual changes in the data - it requires much less querying than the other approaches.

In short, MLDemon’s downside is limited to whatever is better of PQ and RR, while the upside is potentially much better.

The upshot

If labeling data for the purpose of monitoring your models is cost-prohibitive, this approach might be worth trying.

However, it is strictly focussed on monitoring and does not consider how the approach could be integrated into model retraining or other aspects of ML Ops.

And it does not come with an accompanying repo, which would help not just with implementation but also understanding.

You can find the paper here.