Active Surrogate Estimators: How many labels do you really need to approximate model performance?

Part of an ongoing series highlighting insights from papers that have contributed to the development of best practices for production ML

Josh Tobin
Josh TobinNovember 10, 2022

Say you deploy your model in a new setting and want to measure how accurate it is in the new domain.

Naively, you could randomly sample data points from the new domain, label them, and compute accuracy on the labeled data. But in many real-world settings, labels are expensive and hard to acquire. How can you get the most bang for your buck, and estimate accuracy efficiently with a small number of labels?

This paper proposes a label-efficient way to measure model performance using active learning called Active Surrogate Estimators (ASEs).

Production ML Papers to Know

Welcome to Production ML Papers to Know, a series from Gantry highlighting papers we think have been important to the evolving practice of production ML.

We have covered a few papers already in our newsletter, Continual Learnings, and on Twitter. Due to the positive reception we decided to turn these into blog posts.

Active Surrogate Estimators: How many labels do you really need to approximate model performance?


The key idea of this paper is, rather than labeling random points from the new domain, instead label a more “interesting” sample of data. Suppose your model is highly confident on a large fraction of the data in the new domain. Intuitively, if you trust the model’s confidence, the high-confidence data will be less informative to label because your model will rarely be wrong. On the other hand, lower-confidence data might be more informative because your model will sometimes be incorrect.

The method introduced here addresses three technical challenges with the intuitive approach.

Dealing with sampling bias

First, how do we know how the model’s performance on “interesting” datapoints translates to performance on the overall data distribution?

The approach here is called LURE and was introduced in another paper. In short, it’s an extension of importance sampling to the setting where, instead of a data distribution, you have a fixed pool of data, and therefore need to sample without replacement.


where q(i_m) is the probability of labeling sample i_m at iteration m.

Instead of taking the average loss across all labeled data points, we take a weighted average. The weights v_m are smaller for datapoints that had a higher chance of getting labeled, whether because q(i_m) was small or because they were chosen from a smaller pool.

Choosing which datapoints to label

In the naive approach described above, we trust the model’s confidence, so choose datapoints to label based on that measure. However, in the real world, confidence is unreliable: models often produce highly confident incorrect predictions on out-of-distribution data.

To correct this issue, ASEs train a calibrated second surrogate model. Unlike the primary model, the surrogate model needs to be from a model class like Bayesian Neural Networks or deep ensembles that have calibrated uncertainty estimates. Then, we choose datapoints that have a high score according to the XWED acquisition function:


The first term in the XWED function selects points where the surrogate model’s estimate of the label distribution differs from that of the base model. The second term de-emphasizes points where the estimates differ because of noise in parameter space. Intuitively, the goal here is to trade off sampling points that provide the best estimate of the loss and those that will improve the surrogate model the most in future iterations of the active learning algorithm.

Iteratively improving sampling through active learning

Rather than sampling all data points to label using a fixed acquisition strategy, ASEs instead iterate between

  • Sampling new data to label using the XWED acquisition function
  • Retraining the acquisition function using the new data

The main advantage of this approach is that, each time we sample data points to label, we are using the most up-to-date information about the model’s performance to select the most informative datapoints.

This strategy also helps on out-of-distribution data: unlike the primary model, we can retrain the auxiliary model to adapt to the new distribution.


3 1

In the experiments in this paper, this method improves sample efficiency of achieving a desired approximation error by an order of magnitude or so, even on out of distribution data.

The upshot

If you’re label-constrained and want to estimate accuracy more efficiently, this method seems promising.

However, there are a few caveats worth noting:

  • You need to be able to train (and re-train) an uncertainty aware neural network, and you need to be ok with the cost of using that network for a second prediction on your data
  • While the authors have strong results on out-of-distribution data, they did not test the method in an online setting with gradual distribution shift. I suspect it would work well there too with some small modifications

Lastly, I’d love to see work that does super-efficient model monitoring by combining this strategy for sampling data for performance estimation with MLDemon, which achieves better label efficiency by deciding when new labels are worthwhile at all.

Check out the paper here: