THE GANTRY BLOG

Can we do better than "drift detection"?

Part of an ongoing series highlighting insights from papers that have contributed to the development of best practices for production ML

Josh Tobin
Josh TobinSeptember 15, 2022

As we've discussed in the past, "detecting drift" is usually the wrong framing for monitoring ML models in production.

Placing too much importance on "data drift" is one example of the bad advice you'll often hear about model monitoring on the internet. Drift can hurt your models, but it's not guaranteed to. There's no way to know whether a KL=0.17 will have a big impact on performance.

Nonetheless, measuring drift can be helpful because we don't always have labels or human feedback to look at instead.

But what if there was a different quantity to measure, still just using unlabeled data, that would give us a more actionable sense of how our model is doing?

Production ML Papers to Know

Welcome to Production ML Papers to Know, a series from Gantry highlighting papers we think have been important to the evolving practice of production ML.

We have covered a few papers already in our newsletter, Continual Learnings, and on Twitter. Due to the positive reception we decided to turn these into blog posts.

Leveraging Unlabeled Data to Predict Out-of-Distribution Performance

In Leveraging Unlabeled Data to Predict Out-of-Distribution Performance, the authors propose one way to do exactly that. They explore directly approximating the accuracy of the model on an out-of-distribution sample.

Their technique, called Average Threshold Confidence (ATC), works by choosing a threshold on the model's confidence. The threshold is chosen such that the percentage of data points above the threshold at training time matches the model's accuracy. So if your model has 90% accuracy, you would choose the 10th percentile confidence value on your dataset.

1

Results

This super-simple technique works surprisingly well across a range of datasets and types of shifts!

2

One big limitation is that it only works for classification metrics today. More research is needed to see if can be extended to regression metrics.

Another limitation applies more generally to all methods of estimating accuracy on out-of-distribution-data.

The authors show that:

  • Any accuracy estimation technique will fail for some categories of distribution shifts
  • Unless you make assumptions about what the shift will look like

An unfortunate but probably not surprising result: e.g., if you only look at the input data to the model, there's no way to know if your label distribution is changing.

The upshot

Nonetheless, in our interview, Saurabh paints a compelling vision for a future in which we have

  • A good understanding of what types of shifts occur in practice
  • An array of simple, effective heuristics to detect changes in model performance caused by these shifts

If you're looking for a way to monitor signals that are more meaningful than "drift", without access to labels, this might be worth a try (with the caveats mentioned above!)

If you want to learn more, check out the paper: https://arxiv.org/abs/2201.04234

Or our interview with the first author Saurabh Garg.