Testing LLM Apps Isn’t That Different

There’s a common belief that testing LLM-based apps requires throwing out the whole testing playbook. Because outputs are non-deterministic, the thinking goes, traditional testing just doesn’t apply.
I get it. But what I’ve seen happen in practice is teams falling back on manual spot-checking and calling it done.
At one company I worked at we were building a chatbot to calculate the cost an education. We would run the app, look at the output, it looked fine, and we would ship it. There were no structured test cases, and no history of what passed and what failed. And because nobody was sampling multiple runs, the same bugs kept coming back after getting fixed. Somebody would spot-check them once, it passes, they would be marked as fixed, and the next day or two they would be reopened as Still Reproducible.
Yes, an LLM won’t return the exact same string every time. But the intent of a correct answer doesn’t change. If a user asks “what’s the return policy?”, the expected result is still well-defined: mention the return window, stay accurate to the policy docs, don’t hallucinate a number. The exact phrasing will vary. That’s fine. We’re not testing phrasing. We’re testing correctness.
Expected outputs still exist. It’s just that a human or another LLM has to be the judge instead of a simple assertEqual. And because outputs are probabilistic, a single run tells you almost nothing. You need to sample multiple times and measure a pass rate.
Once you’ve internalized those two shifts, everything else is familiar territory.
Judgment Replaces String Matching
In traditional testing you write:
assert response == "Your return window is 30 days."
In LLM testing you write a rubric instead:
“Does the response correctly state the return policy without adding false information?”
The test case still has an expected result. You’ve just moved from matching strings to evaluating intent.
You can do this manually by reading the output and deciding if it passes. Or you can feed the rubric to another LLM and let it judge. It’s quite tedious to do by hand, so the automated version is the logical next step. There are tools out there for that, but I haven’t gotten the chance to play around with them just yet. I’ve just been prototyping an agent orchestration framework to help me perform QA tasks, which is where I gained experience with some of the challenges of testing LLM-based apps.
One caveat I found prototyping it: for structural outputs like JSON, dates, and lists, skip the rubric entirely and use traditional assertions because the structure can be checked deterministically. If you asked for JSON, parse it and check the schema. Save the rubric for where meaning matters, use assertions for everything structural.
The code below is pseudocode, but hopefully the idea comes across clearer with it than without:
def passes_rubric(question: str, response: str, rubric: str) -> bool:
prompt = f"""
Question: {question}
Response: {response}
Rubric: {rubric}
Does the response meet the rubric? Answer only YES or NO.
"""
return llm(prompt).strip().upper() == "YES"
Pass Rate Replaces Pass/Fail
A single test run is a coin flip. This is the part I learned the hard way.
I’ve been using NotebookLM to build and test my own prompts, versioning them, running test cases, checking whether each section returns the right kind of output. Not exact words, just the right class of response and the right formatting. It worked fine for ten runs. Then the eleventh example broke the formatting.
That’s not a fluke. That’s the nature of probabilistic outputs. One pass doesn’t mean it works. One fail doesn’t mean it’s broken. You need enough samples to see the real picture.
Run each test case 10-20 times. Measure the pass rate. Then decide what threshold you’re comfortable shipping at. For a customer-facing chatbot, maybe 95% isn’t good enough. For an internal tool, 80% might be fine. That’s a business decision, but you can only make it if you’re measuring.
Here’s pseudocode for the pass rate side of it:
def eval_with_pass_rate(test_case, n=20):
results = [run_and_judge(test_case) for _ in range(n)]
pass_rate = sum(results) / n
return pass_rate
# Example output:
# "What's the return policy?" → 18/20 passed (90%)
This also gives you something to compare against. Did the new prompt improve things? Run the eval suite before and after. Compare pass rates. That number is real signal, not just a vibe check.
It’s Still Just Testing
At the end of the day, you have inputs and you have expected outputs. That hasn’t changed. You still need coverage. You still need to think through flows, edge cases, and what happens when a user does something unexpected. You still write test cases. You still ask: what should this do, and did it do it?
The mechanics of evaluation look different. Rubrics instead of assertions, pass rates instead of pass/fail. But the thinking behind them is the same thinking you’ve always done. If you’ve been testing software for any length of time, you already know how to do this. You just need those two perspective shifts.
Don’t let the “non-deterministic” label convince you that you’ve forgotten how to test apps. You haven’t.
