If you’re reading this post, it’s probably not news to you that machine learning models can perform poorly on out-of-distribution data.
Detecting and mitigating these degradations can be expensive, because it often requires labeled data.
TENT - Test-Time Adaptation by Entropy Minimization suggests an approach to dealing with distribution shift without labels, and, as a bonus, without needing to keep around the training data either.
Production ML Papers to Know
Welcome to Production ML Papers to Know, a series from Gantry highlighting papers we think have been important to the evolving practice of production ML.
What’s Test-Time Adaption?
Tent proposes adapting to distribution shift purely at test time - that is, without relying on the source data or on ground truth labels to update a model. These are often required for alternative approaches such as train-time adaption or domain adaptation, but can be difficult to acquire.
The core idea behind Tent is that an adaptation objective - entropy - can be used as a proxy metric for model performance, and used to update the normalization layers of the model. As such, the model needs to be probabilistic, differentiable, and trained on a supervised task - for example, many typical deep neural networks.
By targeting only the normalization layers for updating, which make up less than 1% of a model's parameters, the approach is much more efficient than updating all the model’s parameters.
Why use entropy as an objective function? As the charts below from the paper show, predictions with lower entropy have lower error rates. More confident predictions are more likely to be correct, so by reducing entropy, the paper argues that we reduce error too.
How does it work?
Tent starts by collecting the transformation parameters for each normalization layer and channel in the source model. It calculates the entropy - H(yˆ) - of the batch predictions using Shannon entropy: H(yˆ) = − c p(yˆc ) log p(yˆc ) for the probability yˆc of class c.
Tent then updates the normalization and transformation parameters for each channel of the source model on a per batch basis. The normalization statistics used by the source model are discarded, then estimated and updated for each norm layer during the forward pass.
And the transformation parameters are updated during the backward pass using the gradient of the prediction entropy - ∇H(yˆ). The figure below, taken from the paper, provides an overview of the method.
How effective is it?
In the paper, Tent is evaluated against alternate approaches (such as domain adaptation, self-supervision and normalization) on benchmark datasets, using ResNet-based models and batch norm for the normalization layers. It was shown to reduce error and require less computation across corruption robustness and domain adaptation settings.
If you’re looking for a way to adapt your model online without labels and without retraining, Tent could be worth a try.
And it could be combined with other approaches that aim to reduce re-labelling and retraining requirements for an overall monitoring approach that can adapt to new data, provide confidence in the model, and reduce time and cost.
Check out the paper here: https://arxiv.org/pdf/2006.10726.pdf