THE GANTRY BLOG

What can Data-Centric AI Learn from Data and ML Engineering?

Part of an ongoing series highlighting insights from papers that have contributed to the development of best practices for production ML

Josh Tobin
Josh TobinSeptember 22, 2022

Normally, we think of the ML process as model-centric: we iterate on the model until it performs well on a given dataset. Data-centric AI inverts the model-centric process by bringing data into the iteration loop. We improve the quality of the dataset, which in turn translates to a better model.

Production ML Papers to Know

Welcome to Production ML Papers to Know, a series from Gantry highlighting papers we think have been important to the evolving practice of production ML.

We have covered a few papers already in our newsletter, Continual Learnings, and on Twitter. Due to the positive reception we decided to turn these into blog posts.

What can Data-Centric AI Learn from Data and ML Engineering?

Data-centric AI only became a buzzword (ahem, area of research) after Andrew Ng formalized it in 2021, but practitioners have been focusing on data’s role in ML since the dawn of gradients. This week, we’ll explore a paper that covers 5 lessons the DataBricks team learned building data-centric applications that they think will carry over to this new field.

Lesson 1: Data-centric AI needs to be a continuous process, not a static one

Much of the discourse around data-centric AI today centers on the process of finding a good first dataset. From balancing classes to fixing bad labels, applying data centric practices to your training set can improve model performance.

However, once you deploy your model, its evaluation data will start to change. Therefore your mindset should be to automate processes like sampling and labeling. That way, as distributions and taxonomies change, you can re-run your data-centric workflows instead of needing to start over.

Lesson 2: Models are not the right "artifacts" to hand off from research to production

As ML practitioners, we’re often told to save our trained models as “artifacts” so they can be used in a production setting. This paper discourages thinking about model artifacts as the proper abstraction boundary between training and production, because:

  • Models must be retrained, so training code needs to be a production artifact too
  • Models can break because of training-serving incompatibility. This usually occurs for simple reasons, like a feature you trained on not yet being available in production. If you test the training code (not just the model) during deployment, you can catch these incompatibilities before they start to affect users

Lesson 3: Data monitoring must be actionable

As we’ve written about in the past, too many meaningless alerts can cause “alert fatigue”, where practitioners begin to ignore the alerting system entirely. Sadly, meaningless alerts are common in ML, so you should try to avoid them. For example:

  • Don't alert on meaningless metrics like the KL divergence between feature distributions (sounds familiar! 😉). Instead, you can do less-principled but more-actionable things like detecting that the most common value for a feature changed (e.g., from “orange” to “apple”)
  • Tie alerts to the likely effect on the next training run. The paper suggests doing so via feature importance, but approaches like predicting out-of-distribution performance could be even better in the future

Lesson 4: Version everything end-to-end

We’re all aware at this point that we should version models, data, and code. The authors point out that data-centric AI workflows like sampling and annotation should also be versioned.

There’s challenging technical work to be done in figuring out how to combine versions. For example, if you need to throw out your labels every time you change your annotation workflow, annotation changes will quickly become prohibitively expensive.

Lesson 5: Don't assume you can just look at raw data

There's no substitute in data-centric AI for knowing your (raw) data. But, due to security, we don't always have that luxury in production.

Instead, the field needs to develop workarounds like looking at aggregate statistics, generating synthetic data, or spot checking a limited amount of production data.

The upshot

This paper points out challenges you’ll face if you apply the current data-centric AI discourse to real-world production settings. This is a “mindset paper”: it’s worth reading if you want to expand your thinking about data-centric AI, but don’t expect to find all of the answers here.