What better feeling is there as a developer than when all of your tests pass?
In traditional software engineering, we get to experience that feeling on a regular basis, especially when practicing test-driven development (TDD). In TDD, you start by writing test cases for your desired functionality. Then, you write the minimum solution that makes the test pass. Finally, you refactor the code while making sure the tests continue to pass.
In the large language model (LLM) world, TDD is far from a standard practice. In fact, if you’re building with LLMs, your testing probably consists of manually trying a few inputs to the application and reviewing the outputs by hand. The level of quantification needed for automated testing is much more difficult to achieve because measuring success is harder.
As LLMs mature, we need structured approaches to develop with them as a team. In this post, we’ll explore a framework for test-driven development of LLM-powered applications. TDD for LLMs can lead not only to faster development time and fewer errors, but also create a virtuous cycle of continuous improvement of the model as more undesirable behaviors are detected and folded into the tests. To see how, let’s start by understanding how testing LLMs is different from testing traditional software.
Testing LLMs is harder than testing software
Testing ordinary ML models is already quite different from conventional unit testing. In ML, you evaluate your model against data you set aside from your training distribution, and later against a test set from the production distribution. In a typical classification task, precision, recall, and accuracy are easy to compute given ground-truth labels.
These techniques break down for LLMs. You’re probably using a pretrained LLM, so you don’t know its training distribution. And, since these models are trained on data from the entire internet, the test distribution for whatever task you’re using the model to solve is practically guaranteed to differ substantially from the training data.
Then there’s the issue that LLMs output text, which typically lacks structure without deliberate prompt engineering. Text is much harder to evaluate than the labels a traditional ML classifier would predict. Consider the following example:
label=[“photograph of a cat”]
prediction=[“this is an image of a tabby cat”]
How would you quantitatively evaluate the prediction? It’s not obvious.
Evaluating a single use case is hard enough. Now try doing that across every domain your general-purpose LLM might encounter. Your LLM may be highly accurate for simple questions about dogs, but ask it about electromagnetism and it may defy the laws of physics. You might restrict your LLM-based application to a certain topic, but even then, the diversity of potential inputs and outputs is likely much larger than what other ML models would encounter.
Just because conventional methods for evaluating ML models don’t necessarily work for LLMs, doesn’t mean we should give up. The heterogeneity in LLM accuracy by input type is precisely why rigorous testing is important. Without quantitative measurements, you could iterate on your application and improve it in one way while unknowingly sacrificing quality in another.
How to test LLMs
With some creativity, including a recurring theme we’ll see of using LLMs to test LLMs, it’s possible to address the challenges we highlighted. There are two key questions to consider:
1. What data should we test an LLM on?
2. What metrics should we compute?
The first question doesn’t have a static answer. Rather, your evaluation dataset should grow incrementally as you develop a better understanding of your problem space. Start small with toy examples of your own to build some intuition for the LLM’s behavior. Later, after gathering feedback from users and perhaps from another LLM, you’ll augment your evaluation dataset.
Ideally, you’ll have a combination of hard and diverse test cases to challenge your model. Easy or repetitive examples aren’t useful because it’s difficult to observe improvements when the model already succeeds. Recall that in TDD, the tests should fail at first so that by developing the application we can make them pass.
As for what metrics you should compute on those test cases, it depends on how well you understand what the answer should be. If you’re lucky enough to have a source of ground truth, then you can directly measure your LLM application against those labels, just like in regular ML. Otherwise, the best strategy is probably to ask another LLM to evaluate your model. We’ll cover evaluating LLMs in more depth in a future article.
Monitoring LLMs to test application behavior
So far, we’ve covered the building blocks for test-driven development for LLMs. There’s a related idea of behavior-driven development (BDD), which shifts the focus to the application’s overall behavior instead of individual test cases. Teams using BDD define what success looks like from their users’ perspective and build software to achieve that.
We can extend the principles of BDD to LLMs by monitoring end-user outcomes, and not just testing the model’s accuracy. Ultimately, we’re using LLMs to solve problems for users, so it’s worth tracking indicators that users are satisfied or otherwise achieving their goals.
Users typically aren’t super motivated to give feedback, so instituting low-friction, high-signal feedback mechanisms can make a big difference to your monitoring. In the best case, you can collect feedback directly as part of the user’s workflow—the “accept changes” pattern a lot of apps use is one example. You can still leave a space for longform feedback, and even if it’s seldom used, the responses may be especially valuable.
Feedback doesn’t mean much if you don’t act on it. If you have a large volume of feedback, it may be more practical to distill themes from the feedback and then revise how you prompt your LLM accordingly.
Putting it all together: The virtuous cycle of TDD
Let’s zoom out to the full TDD cycle:
As the application developer, you’re also the first user and are responsible for the initial, proof-of-concept test cases. Over time, you’ll dogfood your product within your team and ultimately release it to customers. At each step, user feedback is the driving force of your development flywheel.
Whether you ask for feedback explicitly or collect it through interaction data, the themes you extract from it will shape your evaluation dataset. If you add sufficiently hard and diverse examples, you’ll see your production model struggle. And because you’ll have implemented quantitative evaluation metrics, you’ll have empirical evidence that the model has room for improvement.
From there, you have a couple levers to iterate on your application and to meet the new bar set by your test cases. For one, you can fine-tune your base LLM. While the exact methods are beyond the scope of this article, one technique is to incorporate user feedback into your training data. If you’re using a closed-source LLM, fine-tuning might not be an option, although you could consider swapping out your LLM entirely.
The other way to iterate is through prompt engineering. Methods for this are largely still manual—you would simply try different prompts until your evaluation metrics improve, both on new test cases and old ones.
Between a fine-tuned LLM and improved prompts, the test cases that were once hard for your product will become easier. After that, the cycle continues as your user base grows and finds new use cases to push you to do even better.
At Gantry, we’re building the LLMOps tools to help you achieve this coveted TDD flywheel and accelerate your product development. To learn more about continuously improving your ML-powered applications, get in touch.