TL/DR. As easy as it has become to train models, getting them to work well in real products with real users is still a mess. These ML-powered products need tools that help practitioners figure out how their models are performing and find ways to refine them.
We’re excited to announce that we have raised a combined $28.3M in Seed and Series A funding led by Amplify and Coatue, and with participation from Index Ventures, South Park Commons, and a host of amazing angel investors like Pieter Abbeel and Greg Brockman.
The reason we’re building Gantry is that, despite the “democratization of machine learning,” it’s still too hard to build great ML-powered products. Let’s talk about why.
Machine learning is eating software, and it’s causing indigestion
If you’ve been in the industry a while, machine learning is starting to feel like “a real thing” now. In the ten years since AlexNet, the number of new ML papers doubled every 23 months, while using a state-of-the-art model went from four months of fiddling with CUDA kernels and running grad student descent to calling
AutoModel.from_pretrained(“latest-and-greatest”). Before, machine learning was a fringe technology used by the largest tech companies to sell ads. Now, it’s a whole industry, with billions of dollars of funding, hundreds of thousands of jobs, a share of enterprise IT budgets, and a claim to new products that couldn't have existed before. Or as Andrej Karpathy more eloquently put it, “AI is eating software.”
But if you’re a ML practitioner, you may have a less rosy view. As good as we’ve become at building models, they are only one component of a broader application built to interact with end users. And if you’ve worked on one of these ML-powered products, you’ve probably encountered some of these hard-to-swallow pills:
- I’m sorry, but your users don’t care if your model is SoTA
If you’ve spent time following ML research, you may have a strong opinion on state-of-the-art (SoTA) chasing. But in the context of ML-powered products, the relationship between better benchmark performance and a better product is even more tenuous. You might have felt this pain if you:
- Spent a month driving your accuracy up a few points, only to realize that the new model decreases user engagement.
- Deployed a model that the numbers said was much better, only for 1% of your users to hate it so much you have to roll it back.
- Did a bunch of data / ML work, only to realize your company could have fixed the problem with a small product change.
If it’s true that “what gets measured, gets managed,” then doing the ML part of a ML-powered product sometimes feels like managing a Burger King by measuring the quality of the buns.
- Let’s be honest: you probably don’t actually know if your model is working
While better model performance doesn’t always make users happier, you’re often lucky if you can measure it at all. To do so, you’d need labels, but labels are slow and expensive to collect. Instead, you’re told to measure “data drift,” so you hack together a spaghetti mess of SQL queries and Jupyter notebooks. But what does the KL-divergence of 0.37 you computed actually mean for your model? What’s worse, many model failures don’t even show up in aggregate statistics. Instead, a user silently churns, or a support ticket slowly makes its way through the organization until you’re tasked with figuring out why your model thinks “it’s cold outside” is an invitation to reply with “global warming is a hoax.”
- End-users are the OG adversarial examples
Train and test all you want; chances are your end users are going to find edge cases that your model wasn’t designed to handle. It could be as simple as a new group of users speaking a language your model wasn’t trained for, or as complex as a small bug fix that causes users to prefer a different type of recommendation. Maybe your users happen to own chihuahuas that look a whole lot like muffins, or, if you’re (un)lucky enough to have my friends as users, they might prank your model to test its limits. Most of all, users are fickle – their behavior will change, and that can break an assumption you implicitly baked into your model.
- What do you even do when things break?
Even if you manage to regularly detect poor model performance on real world data, then what? To fix the problem you’re stuck running SQL queries, trying to piece together new training data by hand, finding budget to label it, and then making bespoke charts to convince your team the fix actually works.
Just like military strategists say that “no plan survives contact with the enemy,” the MLE mantra might as well be “no model survives contact with the end user.” If ML is really going to eat software, we need a better way to bridge the gap between the model we trained and the one that’s out there in the wild.
Good models are trained, great models are refined
It’s possible that we have been looking at our goal as machine learning teams the wrong way. Instead of spending your time iterating offline on a static dataset to build the perfect static model, your goal should be to build a continual learning system by shipping a minimum viable model as quickly as possible and refining it in production.
That’s because, as an MLE, you’re effectively a part-time product manager. It’s your job to find ways to improve the model, and the best ideas come from production. Only there can you see how actual users react. For example, if you’re building Google Translate, it doesn’t matter if your model is 99% accurate if the 1% are so bad they cause your users to wage a Twitter war against you.
But refining models in production is scary for many ML teams. Even deploying a model you believe is great can be scary when you don’t know that the model is behaving inappropriately until it’s all over Twitter.
In part, this fear is because our tools and workflows evolved for the static model context. Want to test a new model architecture? Great! Your team probably has intuitive, self-service tools to do so reproducibly at scale. But if you instead want to fine-tune on, say, the predictions your users don’t like, you’re stuck asking a favor from the analytics team to run the query to get the right data, and stitching together 3 different tools to do the labeling, analysis, and retraining. Even if you succeed, you’re on the hook to maintain yet another data pipeline, and keep track of yet another set of data artifacts to reproduce the steps you took.
Gantry: a continuous ML improvement platform
The process of refining models in production is as critical as building the model in the first place, so it deserves to be as easy, rigorous, and repeatable as model development has become. A solution focused on this would empower ML engineers to do the following:
- Aggregate production data from across the ML application, including model inputs and predictions, labels, and explicit and implicit feedback provided by your end-users
- Discover degradations in accuracy, underperforming cohorts, edge cases, and other opportunities to refine the model’s performance
- Gather and label the data needed to make these improvements
- Validate that the fix worked, and generate test cases to make sure it doesn’t happen again
This is what we’re working on at Gantry.
Gantry is a continuous ML improvement platform. We help you figure out how your ML-powered product is really performing, find ways to improve it, and operationalize those improvements.
You can add Gantry to your application by adding a single log line to your production service. As you scale, you can instrument your whole ML application, including capturing labels and feedback from your users.
prediction = model.predict(inputs) gantry.log_record( application="Model Service", inputs=inputs, outputs=prediction, environment="production", feedback_id=my_uuid )
Gantry helps you monitor the performance of your models in production, but is not just a “set it and forget it” ML monitoring library. We also provide powerful and intuitive tools to use data and human feedback to discover opportunities to improve your model.
It’s not enough to be able to occasionally find places to improve your model. Continuous refinement should be a systematic, repeatable process that is connected with the rest of your workflows and infrastructure. Our SDK provides a dataframe-like interface to access all the data and metrics we are computing, so you can easily build labeling, testing, and retraining workflows on top of your production data.
Combining all of this means that ML engineers can build systematic, collaborative, and repeatable workflows that refine and iterate on deployed models, without building and maintaining new pipelines and data infrastructure.
But we’re just getting started. We have an ambitious roadmap aimed at giving you the infrastructure and workflow tools you need to easily build continuous learning systems, so you can focus on what really matters – creating amazing ML-powered product experiences that delight your end users.