If you want a single machine learning model that can solve a variety of image classification tasks, you might look to an open-vocabulary model like CLIP.
CLIP achieves near-state-of-the-art zero-shot performance on certain classification tasks, but not all (e.g., not even MNIST). Ideally, we’d be able to use a small amount of data to adapt the model to new tasks as we encounter them. But naively fine-tuning it to improve performance on a new task leads to performance degradation on older ones (the “catastrophic forgetting” problem).
Today’s paper proposes a solution, based on the idea model patching.
Production ML Papers to Know
Welcome to Production ML Papers to Know, a series from Gantry highlighting papers we think have been important to the evolving practice of production ML.
We have covered a few papers already in our newsletter, Continual Learnings, and on Twitter. Due to the positive reception we decided to turn these into blog posts.
Patching open-vocabulary models by interpolating weights
Painting with Interpolation
The goal of patching is to updating the weights of your model so that they are better suited to the new task, while retaining performance on the original task.
The paper introduces a patching method called Patching with Interpolation (PAINT). The paper links to a repo with a helpful python implementation:
To summarize, PAINT fine-tunes a model like normal on the new task. But rather than using the fine-tuned weights, it uses an interpolation between those weights and the original ones. The interpolation coefficient alpha is chosen by cross-validation.
This process is for patching on a single task. The paper provides three ways to patch on multiple tasks: joint patching, where all patching tasks are merged into a single task before the above procedure is run; sequential patching, where the patching procedure is done sequentially on each new task; and parallel patching, where the first step of each task is run in parallel.
Large Models are easier to patch
The authors tested PAINT on a range of image classification tasks, including supported tasks (which a model - typically a CLIP pre-trained Vision transformer (ViT) - has been trained on), and patching tasks (on which a zero-shot CLIP model performs poorly compared to a specialized model).
Performance for patching models on a single task is summarized in the chart below.
On nine tasks where zero-shot CLIP performs poorly, PAINT increases accuracy by 15 to 60 percentage points while preserving accuracy on ImageNet within one percentage point of the zero-shot model.
PAINT works better with larger models. They are closer in accuracy to specialized models than smaller models (left chart), require less interpolation to fit new data (middle chart), and have higher cosine similarity between the weights of the unpatched and fine-tuned models (right chart).
Performance for patching models on multiple tasks is summarized in the chart below, which shows model accuracies for two different ViT models, patched using the different methods outlined above, against a range of tasks.
A single CLIP model, patched on nine image classification tasks, is “competitive” against specialized models for each task. Joint patching is the best-performing method on average, with parallel patching the worst performing method.
The paper also demonstrates how PAINT enables broad transfer. A ViT patched on one half of a dataset improves its accuracy on the other half of the dataset, despite the presence of disjoint classes between the two halves.
The Upshot
At Continual Learnings, we love a simple technique that works well. This appears it could be one, though there are some clear limitations (for example, accuracy on old tasks can still decrease, especially for smaller models).
The paper also provides applications for PAINT beyond the experiments covered in the paper, such as patching the vulnerabilities of CLIP models to typographic attacks (where text superimposed on an image leads to misclassification).
If you work with open vocabulary models, or are interested more generally in how models can be adapted to new tasks without retraining, then this paper is well worth checking out.
You can find the paper here.