Production ML Papers to Know

Welcome to Production ML Papers to Know, a series from Gantry highlighting papers we think have been important to the evolving practice of production ML.

We have covered a few papers already in our newsletter, Continual Learnings, and on Twitter. Due to the positive reception we decided to turn these into blog posts.

Why CTR prediction is hard

Click-through rate (CTR) prediction is valuable because it’s a primary signal of the usefulness of ads. It feeds directly into the cost per click that advertisers pay.

Google’s CTR prediction model “consists of billions of weights, trains on more than one hundred billion examples, and is required to perform inference at well over one hundred thousand requests per second.” This isn’t a set-and-forget model either. Google is constantly trying to improve its performance without adding training / serving costs or undue complexity.

The paper covers techniques Google uses to improve accuracy, efficiency, reproducibility, calibration, and credit attribution.

We’ll cover their approach here, much of which is applicable for smaller scale systems too. But, first, we’ll describe the model itself.

Model architecture

The paper does not describe the full model architecture, but it reveals some interesting details.

First, the Google team found that the text of the query and ad headlines are critical context for the model, but, for performance reasons, they forgo representing them with a LLM in favor of a smaller model that uses classical text features like n-grams.

Beyond that, the baseline model is pretty standard — the remaining features are embedded, the embeddings are concatenated, and the model is trained using AdaGrad, log loss, and ReLUs. Google CTR engineers are just like you and me.

The rest of the paper describes ways they improve on this baseline.

Reducing costs through efficiency

As committed as Google is to ML, even for them any gain from ML needs to be weighed against cost: not just cost of training, but also “long-term cost to future R&D.” This frequently leads to killing ideas that improve performance but are deemed not worth the cost.

So, a parallel aim to improving accuracy is improving efficiency. That means models are evaluated by two metrics: 1) Does accuracy go up when training cost is flat? 2) Is training cheaper if model capacity is lowered until accuracy is neutral?

Here are some techniques Google uses to improve efficiency.

  • Bottlenecks. Neural networks with wider layers are more accurate but slower and more costly. The Google team found that you can use bottleneck layers (low-rank matrices) to get some of the benefits of larger layers without a massive const increase.
  • AutoML. CTR models benefit from tuning tons of hyperparameters: embedding widths, layer widths, etc. That’s what AutoML does well, but standard AutoML isn’t cost effective. Instead, the team uses a variant of neural architecture search (NAS) based on weight sharing. In particular, the team tunes the algorithm to evaluate against different constraints (e.g., cost at no more than 85%/90%/95% of the baseline cost). Recently, this AutoML technique led to a reduced time per training step of 16% without reducing accuracy.
  • Data sampling. The CTR model exhibits diminishing returns in performance the size of the training set increase, so sampling is another effective cost reduction measure. The team improves on random sampling in a few ways: (1) restricting to more recent data, which is more relevant for CTR prediction, (2) oversampling “clicked” examples, which are rarer and more important, and undersampling “non-clicked” ones, and (3) sampling data with low log-loss or advertisements that were unlikely to be seen by the user.

Improving accuracy

The paper also discusses techniques aimed at improving accuracy.

  • Loss engineering. Improving log loss doesn’t always improve key business metrics in a production setting. The Google team put a ton of effort into designing auxiliary loss functions to better align online and offline metrics. Some ideas they employed include:
  • Ranknet loss, which aims to make sure the set of candidate ads are properly ranked relative to one another.
  • Distillation, which trains a smaller model called the “student” to match the predictions of a larger “teacher” model. A surprising discovery in modern deep learning is that knowledge distillation often leads to “student” models that are more capable than training the smaller model from scratch. In Google’s case, this lets them train a larger “teacher” model than would normally be computationally feasible in production.
  • Loss curriculum, which borrows from curriculum learning by gradually introducing the more complicated loss functions throughout the course of training.
  • Second-order optimization. The common wisdom in deep learning is that “second order methods don’t work; just use momentum-based SGD”. Not so. The Google team found that using a recent algorithm called Distributed Shampoo led to significant improvements in accuracy with only a 10% increase in training time.
  • Deep and Cross Networks. The paper outlines the use of a DCNv2 variant using bottlenecks to learn the effective feature crosses that are critical for recommender systems. This Deep & Cross Network is added between the embedding layer described earlier, and the DNN, and resulted in an accuracy improvement of 0.18% with a minimal increase in training cost of 3%.

Increasing reproducibility

Perhaps one of the most fascinating sections of this paper is on reproducibility.

Training runs for these models are rarely reproducible due to factors like random initialization, non-determinism stemming from distributed compute, numerical errors, hardware, and more.

Irreproducibility is hard to detect in training metrics, and may impact downstream R&D - model deployment leads to further divergence, as predictions from deployed models become part of subsequent training and research.

To combat this, Google uses the metric Relative Prediction Difference (PD), which measures the absolute point-wise difference in model predictions for a pair of models. PDs are “as high as 20% for deep models”, and methods such as fixed initialization, regularization, dropout, data augmentation don’t make much of a difference. Ensemble techniques help, but introduce their own forms of technical debt.

Experimentation showed that ReLUs were a contributing factor because the “gradient discontinuity at 0 induces a highly non-convex loss landscape.” Moving to the Smooth ReLU (SmeLU) activation function led to a PD less than 10%, and also improved accuracy by 0.1%.

Generalizing Across UI Treatments

CTR performance of an ad is impacted by the UI it belongs to, so it’s important to be able to tease apart the contributions of the two. To do so, the Google team replaces the single CTR model with 𝜏(𝑄·𝑈), composed of a transfer function 𝜏 and separable models 𝑄 and 𝑈 that output vectorized representations of the Quality and the UI and are combined using an inner-product.

The upshot

The practical advice provided in this paper is well worth understanding, even if you, like more or less every other machine learning team today, operate at a much smaller scale than Google’s CTR model.

You might want to read it alongside another paper we recently summarized on MLOps best practices from a wider range of companies.

Check out the paper here.