This week, let's talk about what you can do if you have more data and labels than you know what to do with 🤑

Your favorite AI breakthrough of the last 1-2 years was probably trained on a massive amount of web-scraped data. Before a few years ago, you needed to employ an army of annotators would have to get labels for all it. Now, researchers use techniques like self-supervised learning that give us "labels" for free as we scrape.

It's nice to have lots of labeled data, but it also introduces new problems like "what subset of our data should we train on?" Naively, we'd just sample a minibatch from our corpus over and over until we expend our compute budget. In this paper the authors ask "can we do better"?

Production ML Papers to Know

This is a continuation of Production ML Papers to Know, a series from Gantry highlighting papers we think have been important to the evolving practice of production ML.

We have covered a few papers already in our newsletter, Continual Learnings, and on Twitter. Due to the positive reception we decided to turn these into blog posts.

Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

The approach

The first key idea is online batch selection. Rather than running each training step on a random minibatch, first:

  • Sample a larger batch
  • Rank the items in the batch by a label-aware selection function (LASF)
  • Pick the top n items and train on them So, what's the right label-aware selection function?

A baseline is to just train on points that currently have the highest loss. This makes sense because it avoids training on already learned datapoints. However, it also will train on data points with the wrong label (they're not learnable) and outliers (they're not worth learning).

To deal with these issues, the authors propose a new label-aware selection function called RHO-LOSS.

Let's break this down.

The term on the left is just the baseline we described earlier. The second term is the loss of a different model on the same datapoint. This irreducible loss (IL) model is:

  • Smaller
  • Trained separately on part of your holdout set holdout set

To build intuition about why this works, suppose the label is noisy. In that case, the training loss will be high, but IL model will also have high loss since it wasn't trained on this particular datapoint. These two terms cancel, the RHO-LOSS is small, and the datapoint isn't prioritized for training. Something similar happens for outliers as well.

So this new label-aware selection function selects points that are not yet learned, learnable, and worth learning.

Experiments

This is the marquee chart in the paper. On Clothing-1M dataset, this technique works quite well, giving an 18x speedup over random sampling.

The other takeaway from the experimental results is that you can get away with training just one tiny holdout model (e.g., a small CNN) that will work across all of your researchers and all of your experiments, making this process a lot more feasible.

The upshot

If you're training big self-supervised models this is worth checking out.

A few caveats to their approach:

  • Their experiments are small-scale. Even Clothing-1M is only a million data points, which is smaller than the scale where these results would be most useful
  • Ignoring outliers may not be the best idea in practice. Depending on your task, good aggregate performance may not be the right goal: it may be critical to build a model that performs well on outliers
  • The computational overhead of this technique only makes sense if your training is highly parallelized and you use a few more tricks. Check out the paper for more details

Check out the paper here: https://arxiv.org/abs/2206.07137