The MLOps Process

The paper defines the MLOps process as “a continual loop of (i) data collection and labeling, (ii) experimentation to improve ML performance, (iii) evaluation throughout a multi-staged deployment process, and (iv) monitoring of performance drops in production”.

These responsibilities are “staggering”, with ML Ops “widely considered to be hard” - perhaps because our current understanding is “limited to a fragmented landscape” of white papers, thought pieces and a “cottage industry” of start-ups aiming to address MLOps issues.

This paper aims to clarify MLOps by identifying what it typically involves and the gaps. We have summarized the (extensive) findings below. This starts with the common practices for successful ML experimentation, deployment, and sustaining production performance, and ends with a summary of MLOps pain points and anti-patterns that need to be addressed.

The Three “Vs” of MLOps

The paper identifies three properties of the ML workflow that dictate success for a ML deployment: Velocity (prototype and iterate on ideas quickly); Validation (test changes, prune bad ideas, and proactively monitor pipelines for bugs as early as possible); and, Versioning (manage multiple versions of production models and datasets for querying, debugging, and minimizing production pipeline downtime).

These properties are present - and sometimes in tension - in the common practices and pain points discussed below.

Developing models

ML engineering was found to be very experimental and iterative - it is beneficial to prototype ideas quickly and demonstrate practical benefits early. Here are some keys to successful prototyping:

  • Good Project Ideas Start With Collaborators. Ideas, such as new features, often come from domain experts or data scientists
  • Iterate on the data, not necessarily the model. Experiments that provide more data or context to the model often work better
  • Account for diminishing returns. Work on ideas with the largest gains, as these gains will diminish through the stages of deployment
  • Small changes are preferable to larger changes. Keep code changes as small as possible and support config-driven development to reduce bugs

Evaluating and deploying models

The goal of model evaluation is to prevent bad models from making it to production without compromising velocity. Here are some keys to successful evaluation and deployment:

  • Validation datasets should be dynamic. Engineers should update validation sets systematically in response to live failure modes and model underperformance on important user subpopulations
  • Validation systems should be standardized. Keeping in mind that it might be difficult given the point above, and the tension this creates with pursuing velocity
  • Spread a deployment across multiple stages. In the study, a ‘shadow stage’ was often helpful in convincing stakeholders of the benefits of a new deployment
  • ML evaluation metrics should be tied to product metrics. This should be an explicit step in an engineers’ workflow and align with other stakeholders to make sure the right metrics were chosen.

Sustaining Model Performance

According to the paper, sustaining high performance in production pipelines is hacky for a lot of organizations. Instead, sustaining models [should] require deliberate software engineering and organizational practices. These include:

  • Create new versions: frequently label and retrain on live data. It could be every day (“you don’t really need to worry about if your model has gone stale if you’re retraining it every day”), or when a pre-defined threshold for pipeline performance is triggered
  • Maintain old versions as fallback models. This reduces downtime when a model is broken by having a fallback model to revert to
  • Maintain layers of heuristics. For example, you can add a heuristics layer on top of anomaly detection model to filter surfaced anomalies based on domain experience
  • Validate data going in and out of pipelines. Continuously monitor production models with checks on expected features, their types and how complete the data was
  • Keep it Simple. This manifested in different ways, with some preferring simple models where possible, and others utilizing higher-capacity deep learning models to simplify their pipeline.

Persistent MLOps Pain Points

The paper highlights persistent pain points - expressed as tensions and synergies between the three “Vs” covered earlier - and uses them to suggest opportunities for future tooling. The main pain points were:

  • Mismatch Between Development and Production Environments. Examples include data leakage, Jupyter Notebooks usage, and non-standardized code quality (with production code sometimes not reviewed because ML was “experimental in nature” and reviews were a “barrier to velocity”!)
  • Handling A Spectrum of Data Errors. The spectrum includes hard (mixing or swapping columns), soft (such as a few null-valued features in a data point), and drift errors. As a side note, the paper found it’s hard to create meaningful alerts, which can lead to alert fatigue for the team
  • Taming the Long Tail of ML Pipeline Bugs. These bugs are long-tailed, which makes them hard to write tests for, and creates a “sense of paranoia”
  • Multi-Staged Deployments Seemingly Take Forever. Multiple participants complained that the process from conception to validating the idea took too long. This means that if ideas can be invalidated in earlier stages of deployment, then overall velocity will increase

MLOps anti-patterns

The paper highlights some MLOps anti-patterns, like:

  • Industry-Classroom Mismatch. The skills required to do MLOps effectively are learned in “the wild”, not in school
  • Keeping GPUs Warm. Sometimes, teams focus on running a lot of experiments rather than the right ones. Hyperparameter searches are often overrated.
  • Retrofitting an Explanation. Engineers sometimes “just try everything and then backfit some nice-sounding explanation for why it works”
  • Undocumented Tribal Knowledge. High velocity experimentation makes it hard to maintain documentation. We’ve all seen the model everyone is afraid to touch because the developer left the company

Conclusions

This paper has so many useful nuggets of wisdom about production ML, so you should just read it. Here are a few of our high-level conclusions:

  • MLOps is fragmented and changing quickly
  • Production ML is not just about ML, it’s also about the business context. For example: shadow deployment stages to build buy-in, business metrics to validate mode value, retrofitted explanations to justify what works, and the continual need for velocity to demonstrate practical progress early
  • Industry approaches diverge from those grounded in academia. For example: in validation (hold out datasets with one metric versus dynamic updating of evaluation sets), and in dealing with distribution shifts (daily retrains on fresh data to generate new models)
  • MLOps differs in practice from software best practices. For example: code isn’t reviewed as frequently, pipelines can be undocumented, and the experimental nature of ML persists into production

This paper is a fascinating snapshot of a nascent field, and well worth reading in order to support the development of your MLOps practices.

The paper can be found here.