The launch review is on track until the compliance lead asks one question: how do you know the assistant answers from our policy documents rather than around them? You have a demo on your laptop and a good feeling built from a hundred hand-run queries, and that is the entire answer. Someone offers that the answers mostly come from the documents, someone else asks what "mostly" means as a number, and the room goes quiet. The launch slips a sprint while everyone waits for evidence nobody thought to collect. The fix turns out not to be a quarter of engineering work but twenty test cases, a judge prompt, and an afternoon of labeling, and this chapter walks you through building exactly that.
What a grounded answer is
The part so far built the machinery: retrieval, the search stack, memory, budgets, and governed sources. This chapter proves it works, because a pipeline that fetches the right passage can still produce an answer the passage does not support.
Grounding is the property you are checking: an answer is grounded when every load-bearing claim in it traces back to a retrieved source you trust. A load-bearing claim is a statement the user would act on, the price, the policy rule, the deadline, the dosage; connective text ("here is what the policy covers") carries no load and needs no source. Grounding needs its own eval because the model produces fluent text either way: handed a passage that half-answers the question, it returns a complete answer at full confidence, the missing half filled in from its training data, and nothing in the output marks where the passage ended and the filler began. Why hallucinations are context failures, not model failures explained why that happens; this chapter measures how often it does.
An answer is grounded when a checker can trace every load-bearing claim to a source you trust, and a claim nobody can trace is a claim nobody checked.
The two-layer eval: check retrieval, then check the answer
A single "was the answer right" score is hard to act on, because a wrong answer has two causes with two fixes: either the right passage never came back, a retrieval problem, or it came back and the answer contradicted or outran it, a generation problem. So the eval runs in two layers, each with its own labeled data and its own score.
- Layer one, retrieval checks: did the right passage come back at all? Build a labeled set of real user questions, each paired with the passage in your document set that answers it. Run every question through your retrieval step and record whether the labeled passage appeared in the results. The score is the hit rate, the share of questions where it did. This layer needs no judge at all, which makes it cheap enough to rerun on every change to the search stack.
- Layer two, answer checks: does the answer match what the passage says? Collect real answers together with the passages retrieval handed the model, then use a judge model to verify each one claim by claim, quoting the line in the retrieved passages that supports each claim or marking it unsupported. The score is the grounded rate, the share of answers with no unsupported load-bearing claim. A judge is a model grading a model, so it inherits the discipline from Graders: deterministic, judges, and humans: a written rubric, a required quote behind every verdict, and a human spot-audit of its calls, because a judge nobody audits drifts just like the product does.
Kept separate, the layers also tell you which fix to fund: a low hit rate calls for search-stack work before anyone touches a prompt, while a high hit rate with a low grounded rate means the facts arrive and the answer departs from them, which is prompt and instruction work. A blended score hides that diagnosis.
Gate on two numbers, not the ranking scores
Retrieval research offers a menu of scores with names like MRR, NDCG, and precision at k, all measures of where in the results list the right passage landed. They earn their keep while an engineer tunes the search stack, and they fail as release gates, because each can improve while users keep receiving the same wrong answers. Similarity scoring is worse: an answer stating the opposite of its source shares nearly all its words with it and scores as highly similar.
A release gate needs numbers that describe what a user experiences, and the two layers already produced them:
- Retrieval hit rate: how often the right passage comes back for a question you labeled.
- Grounded-answer rate: how often an answer contains no load-bearing claim the judge could not trace to a source.
Wire both into your release gate the way The quality bar: decide what good means wires any quality dimension: a number, a bar, and an agreed rule for changes that land below it. The question that stalled the launch review then has a standing answer, read off a dashboard rather than argued from a demo.
Show the sources in the interface
Provenance, meaning the display of which passages an answer drew from, is a trust feature: an answer with checkable citations invites verification instead of blind acceptance, which is why research products such as Perplexity attach numbered citations to every answer. It is also a free auditing channel: users click the citation that looks wrong, so every flag and complaint about a source is a grounding check you did not fund. Route those reports to whoever owns the eval, in the spirit of Metacognition: help people catch wrong answers.
Honest display beats confident display: an interface that admits it found nothing solid earns more trust than one that answers anyway in the same assured layout.
In practice, when retrieval comes back empty or weak, the product says so and stops rather than presenting an unsourced answer with the same typography as a sourced one. The refusal costs you one interaction; the unsourced answer risks a public screenshot.
Grounding decays, so sample production weekly
A passing eval proves launch day, and launch day only. After it, the documents get rewritten, the index refreshes on its own schedule or quietly stops, and the mix of questions moves toward topics your labeled set never sampled, all of which move your two numbers without any code changing. The cheap protection is a weekly sample: pull a few dozen live question-and-answer pairs, run them through the same judge, and chart the two rates next to the launch bar, as Production signals: evals after the ship teaches for every quality dimension. A falling hit rate on a recently rewritten policy is a staleness alarm as much as an eval result, and it feeds back into the rules from Freshness and conflicts: govern the knowledge you answer from.
Try it now
Build the twenty-case grounding eval against your own document set. Budget about an hour with the tools you already run.
Get your questions. Pull ten real questions users ask of your document set, from tickets, chat logs, or search queries. For each one, find the passage that answers it and record the pair. This is your labeled retrieval set.
Run the retrieval check. Send the ten questions through your retrieval step and record, for each, whether the labeled passage came back in the results. Write down the hit rate.
Collect ten answers. Ask your product ten more real questions and capture, for each, the full answer and the passages retrieval handed the model.
Write the judge prompt. Instruct a strong model to split each answer into factual claims and, for each, quote the supporting line from the retrieved passages or mark it unsupported. Score each answer pass or fail on zero unsupported load-bearing claims, and write down the grounded rate.
Spot-check five. Read five of the judge's verdicts against the sources yourself. If you overrule more than one, tighten the rubric and rerun before trusting any number it produced.
Set the gate. Put the two rates side by side and pick the bar you would genuinely hold a release to, with the rule for what happens below it written in one sentence.
Scale it down: eight cases, four per layer, and two spot-checks give you the same two numbers in half the time, and you can expand the set later.
Chapter Summary
- An answer is grounded when every claim the user would act on traces back to a retrieved source you trust; a fluent answer with an untraceable claim is not grounded, just confident.
- The model produces assured text whether or not the passages support it, so grounding has to be measured, never assumed from a demo.
- The eval runs in two layers: a retrieval check (hit rate against a labeled question-and-passage set) and an answer check (a judge model tracing claims to sources, spot-audited by a human).
- Two separate scores tell you which fix to fund: a low hit rate is search work, and a high hit rate with a low grounded rate is prompt and instruction work.
- Gate releases on the hit rate and the grounded-answer rate; ranking and similarity scores are tuning tools for engineers, not release evidence.
- Show sources in the interface, and when retrieval finds nothing solid, say so instead of answering anyway; users clicking a wrong-looking citation are audits you did not pay for.
- A displayed source must support the claim it sits next to, so put citation-claim match in the judge's rubric.
- Grounding decays after launch as documents, indexes, and question mixes move, so run a weekly judged sample of live traffic against the launch bar.
- Every decision this part made, from sources to memory to this pass bar, is ready to go on one page; Write your Knowledge Charter and ship a product that knows its facts walks you through writing it.
Sources
- Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217.
- Rashkin, H., Nikolaev, V., Lamm, M., et al. (2023). Measuring Attribution in Natural Language Generation Models. Computational Linguistics.
- Gao, T., Yen, H., Yu, J., & Chen, D. (2023). Enabling Large Language Models to Generate Text with Citations. EMNLP 2023.
- Husain, H. (2024). Your AI Product Needs Evals. hamel.dev (last verified July 2026).