There’s a lot that goes into building a fully fledged eval suite for your AI features.
Compared to traditional (deterministic) software, implementing thorough test coverage for non-deterministic LLM-powered software is massive & complex.
You’ve got to define your eval scenarios, generate synthetic input data, implement traces, run & automate traces for your test set, conduct error analysis & annotate your traces, synthesize your annotations into failure modes, calculate inter-annotator agreement, develop the actual evals for each failure mode (which probably includes making an LLM-as-judge, which is a process in itself), calculate the uncertainty & true success rate for each of your eval metrics, then instrument automation to run the eval suite and track it over time. And then, finally, you can start making improvements and hillclimbing against your metrics.
And that’s only a high-level view of the process.
And much, much more of this process falls on the product manager. You can’t ‘just’ expect engineering to create a test suite that validates each of your acceptance criteria.
BUT not every feature is worthy of that level of investment. whether because it’s too basic, too experimental, or too reliable (what a nice problem to have!).
Going through this whole process isn’t typically the place to start, either. You probably want to get your feature into the real world faster than that. Especially when executive expectations for velocity are way higher in this new AI world.
Remember that your process should directly help you ship better working software, not be an end in itself. You need something you can put in front of users that helps solve their problems.
Before getting into what to do, remember why you’re doing this.
Spending time with your usage data is the single best way to build your intuition for the model in the context of your use case. Actually looking at your data will help you improve your prompt and narrow down the myriad ways you can iterate. Giving you clarity and cutting down on time wasted optimizing things that won’t move the needle. After all, this whole article is about how doing it ALL isn’t always worth it.
It’s incredibly important that the product manager (or person wearing that hat) is the one doing this work. Delegating away this work will leave you disconnected from the product and significantly underinformed on how to move forward and build a product people actually want and use.
Reviewing this data is an efficient & easy window directly into the customer experience. You see exactly how they interact with the feature and what they get back from these probabilistic models. Plus, those inputs (for most cases anyway) will give you deep insight into user expectations, needs, and desires. And since it’s real use, it’s not biased by human nature like a customer interview, where you’re likely to be told things that just aren’t realistic.
This article is all about when NOT to build a full evals suite, but fortunately, this foundational AI evaluation work turns out to be exactly the prerequisite work for building one should you need to later!
One final thought here. This is much like how, in traditional software, you wouldn’t ship a feature without at least manually testing it and hopefully with your engineers building unit tests for the important stuff. In important or high-risk areas, you would also want thorough integration tests or end-to-end tests — but in practice, we certainly rarely do that for every aspect of the product. In LLM-powered products, we should always be capturing traces, looking at our data, and running test scenarios. And for the important or high-risk areas, we’ll want to develop a full evalr suite.
Before You Ship
Before shipping, you need to test your AI product or feature, which will generate “traces” that you need to review and annotate. That’s the foundational must-haves. Once you’ve done it, I’m sure you’ll never look back.
Before we jump into the “how”, let’s quickly cover what a trace is. A trace refers to the record of your AI feature’s full behavior for a given input. It’s a simple record of everything that happened. For a typical LLM pipeline, you’ll have a user prompt, a system prompt, and an output/result. If you’re doing tool calling, that would be logged too. It’s typically an engineering responsibility to have this up and running (perhaps yours if you’re prototyping in a no-code or low-code environment like Zapier or n8n). But the product person will need to ensure proper, thorough tracing, and (spoiler) you need to read the traces!
There are two ways to define your test scenarios (manually & synthetically) and two ways to test your scenarios (manually & automatically).
Your test cases need to represent realistic usage, not comprehensive coverage of every possibility. I’m going to assume you know your customers’ needs and wants (since that’s just good product management, not AI-specific), but you can also leverage subject-matter experts or even prospective users here.
If you’re prototyping an AI pipeline and want to incorporate subject-matter experts or prospective users, focus your energy on building an understanding of their use cases and expected outputs as well as seeking verbatim inputs.
1. Define Test Scenarios
You need realistic test cases that represent how your users will actually interact with the feature. This isn’t about comprehensive coverage of every possible edge case, though. That’s what a full eval suite is for. This is about making sure real use cases work as expected.
How to come up with scenarios:
What are the 2-3 things that really change how users interact with your feature? Think through the key dimensions of variation in your use case.
For example, if you’re building an AI feature that summarizes customer support tickets:
- Ticket complexity: simple question, multi-part issue, escalated complaint
- Tone: neutral, frustrated, confused
- Desired outcome: quick answer, detailed explanation, refund request, escalation to a human
You don’t need to overthink this. Just think through what actually varies in real usage. If you know your customer and their needs or jobs-to-be-done, this should be relatively straightforward.
Once you have your dimensions, come up with a few variants for each (3-5 options per dimension is usually plenty). Then create 15-20 realistic example inputs that cover different combinations of these variants.
Where to get your actual test inputs:
- Pull verbatim examples from customer research or past support tickets
- Ask subject matter experts or prospective users for realistic examples
- Write them yourself based on your product knowledge
Keep these in a spreadsheet. You’ll use them for manual testing now, and they’ll make automation easier later.
Want to level up? You can also synthetically generate test scenarios and input data to create more comprehensive coverage with less manual work. But don’t let that stop you from starting simple.
2. Scenario Testing
Manual Scenario Testing
Don’t let automation hold you back from testing your scenarios.
It’s perfectly okay to use the feature yourself at first, and then even copy and paste directly from a spreadsheet. You want to test early in the process, and manual testing is the easiest place to start. Keeping track of your sample inputs (and expected outputs) in a spreadsheet will keep you organized and make automation much easier when you’re ready for it.
Automated Scenario Testing
If you’re in the prototyping stage, you can automate this pretty easily with a tool like n8n. Set it up to read each scenario from your spreadsheet, run it through your model, and write the results back to a spreadsheet (same one or different, whatever works for you). Then you can review all the outputs in one place.
If you’re trying to test a production or staging feature, automation gets trickier. You’ll probably need to work with your engineers on a viable testing approach. Maybe they can run the scenarios locally for you, or you can figure out a way to hit your staging environment programmatically.
But honestly? If automation is going to slow you down at this stage, just stick to manual testing. The point is to look at the outputs and build intuition, not to build perfect test infrastructure.
3. Read & Annotate Synthetic Traces
Once you’ve run your test scenarios, you need to actually look at what happened. This is where most people want to skip ahead, or just as bad… blindly pass it to an LLM to evaluate and hope for the best.
Reading traces is how you build intuition for what the model is doing. You’ll spot patterns in failures, discover edge cases you didn’t anticipate, and understand where your prompt needs work.
For each of your test scenarios, review the full trace—the input, the system prompt, and the output. Then annotate it:
- Label it as SUCCESS or FAIL
- Write a quick note about what happened and why it failed (if it did)
You can do this right in your spreadsheet by adding columns for the result and your notes.
Who should do the annotating?
You, the product person, should always read through the traces yourself—especially at the start. This is how you build intuition and understand what needs to improve.
That said, if your feature requires domain expertise to evaluate correctly, you need someone with that expertise doing the annotation. Most product managers can’t judge whether a legal summary is accurate or a medical diagnosis is correct. If you’re building something where you genuinely can’t tell if the output is right, get a subject-matter expert involved. If that’s the case, just have them go through the exact same process: read the input(s), read the output(s), label it SUCCESS/FAIL, and note why it failed (if it did).
For most features, though, you can start with your own judgment. If you later need multiple people annotating and measuring agreement between them (called “inter-annotator agreement”), that’s a sign you’re ready to level up to a more formal eval process.
4. Iterate & Improve
Now you actually fix things. Based on the patterns you saw in your annotated traces, you’ll iterate on your prompt and context until you’re happy with the success rate. When solving for a particular failure, you can keep re-running the scenario until you succeed. Ideally, once you’ve solved the failure mode, you’ll want to rerun all your test cases to check you didn’t break something (known as a “regression”).
What to adjust:
- Prompt engineering: Look at each failure and think about what instruction would have prevented it. Adjust both your user prompt template and system prompt based on what you’re seeing.
- Context engineering: This is everything you’re providing to the model beyond what the user directly inputs. Even though it all ends up in the prompt eventually, it helps to think about context separately from the instructions aspect of the prompt. What additional information does the model need to succeed? This could be relevant background information, examples, or anti-examples. If you’re doing RAG, this is where you pay special attention to your retrieval to ensure you’re passing the right chunks to the model.
- Model selection & parameters: You can experiment with different models here, but don’t jump between models once you’ve made your initial selection (based on your test case results and product requirements). Prompt and context changes usually make more difference. Parameters like temperature are less common now anyway, and the other parameters rarely make or break a feature.
- Product & UX adjustments: Don’t forget you can solve problems outside the LLM layer. Sometimes the right fix is clearer copy, a different UI affordance, or setting better user expectations about what the feature can do.
Keep cycling through this process—adjust, test, annotate, adjust again—until your success rate is where you need it to be.
Once You Ship
The main thing that changes once you ship is that your emphasis shifts to real traces rather than the synthetic scenarios you defined earlier. The synthetic scenarios still help you test your changes, but the real results become the place you identify failures.
Make sure your tracing is set up well, and it’s easy for you to read the traces. Your engineering team (or your favorite AI assistant) should be able to help you with that. If reviewing traces feels like pulling teeth, you won’t do it — and you need to be doing it.
5. Track & Version Your System Prompt
Tracking your prompt version is critical to making traces manageable. If you don’t know which prompt was used when a trace was generated, it’s very hard to track what’s going on.
Things can get pretty sophisticated here, but a good starting point is to assign a version number or other unique identifier to each iteration of your prompt (include dynamic placeholders for injected content where applicable).
I like to borrow from semver and track changes with v
What to track (and connect to every trace):
- Prompt text (both system and user prompt templates)
- Model version
6. Read & Annotate Real-World Traces
Now that you’re reading real traces, you’re going to want to annotate them again: PASS/FAIL and open notes, just like before.
You might be annotating in an evals tool that’s capturing your traces. If your traces are being stored in an observability tool built for evals, it’s best to keep your annotations right alongside the traces rather than in a separate spreadsheet. But as with all this, make a cost-benefit decision that makes sense for your situation.
At this stage, you might want to capture more structured data, but that’s a question of whether you’re actually going to use it and if you’re ready to invest beyond the basics.
How many traces should you review?
This is almost like asking “how many users should I talk to?” during product discovery. It really depends. More is better, but properly reviewing 100 diverse traces is much more valuable than superficially racing through 1000 without much thought. Generally speaking, you want to keep reading so long as you keep learning. And then you want to repeat the process often enought that you’re keeping your finger on the pulse. I’d recommend making a weekly routine (so long as there’s new traces to read)
If you’re lucky enough to be shipping to a lot of users, you might end up with too many traces to possibly read. First off, don’t look at 150 traces and decide that’s too many. This is super valuable work even if it feels mundane. It’s in the same league as talking to real users. A conversation might not feel valuable in the moment, but any product manager knows how important a conversation with a customer is. An AI product manager knows the same thing about reading traces.
If you are ending up with thousands of traces, though, you’ll want to appropriately sample which ones you review.
How to sample traces:
- Flagged traces: If you gave users the option to provide feedback (thumbs up/down, report issue, etc.), start with those! Don’t only read flagged traces though… You’ll miss patterns in what users don’t bother flagging or worse, don’t realize is incorrect.
- Random selection: Just randomly select from the bunch.
- Stratified random selection: Group your traces by some attribute (user type, prompt length, or some other relevant context) and then randomly select from each bucket.
- Look for anomalies: Use associated metrics to find outliers (high latency, unusually long conversations, high token counts, lots of tool calls, etc.).
7. Iterate & Improve
This is the same process as before you shipped (Section 4). Based on the patterns you see in your real-world traces:
- Adjust your prompts and context
- Test your changes against both your original test scenarios (to catch regressions) and the new failure cases you’ve discovered
- Annotate the results
- Ship the improvements
- Repeat
The cycle continues. The difference now is that you’re learning from real usage, which means you’re not guessing, and the failure modes you’re solving actually matter to your users. And unlike the hypothetical scenarios, it’s hard to invest in solving real problems for real users!
Should I Expand Into Actual Evals?
You’ll know when you’re ready to build a full eval suite. You’ll feel it.
Signs it’s time to level up:
- Volume: You’re drowning in traces. Even with good sampling strategies, you can’t keep up with the flow of real-world usage. Manual review is becoming a bottleneck for shipping improvements.
- Risk: The feature is high-stakes enough that failures have real consequences. Maybe it’s customer-facing and directly impacts revenue, or it’s making decisions that could have legal or compliance implications. You need confidence that changes won’t break things in production.
- Unpredictability: You keep seeing intermittent failures that you can’t reliably reproduce or solve. You fix something, ship it, and it seems better… but then the same failure mode pops up again in a slightly different context. You need systematic testing to know if you’ve actually solved the problem.
What to build first:
Don’t try to build evals for everything. Start with the highest-leverage cases:
- Persistent failure modes: The issues that keep coming up or appear intermittently despite your fixes. If you can’t reliably solve it with prompt tweaking alone, build an eval for it.
- Low-hanging fruit: Deterministic checks that should always pass. These are often the easiest evals to write and give you quick wins on regression testing.
The good news? All the work you’ve done—defining scenarios, annotating traces, identifying failure modes—sets you up perfectly for building evals. You’re not starting from scratch. You already know what matters.