Skip to content
AI-Native PM
8 min · 0 of 8 in Context & Memory

The search stack: match the retrieval method to the question

Your legal team gets an internal contracts assistant, and one afternoon someone asks it the most ordinary question in the building: which agreements expire in March. The assistant searches its embedding index and returns three contracts that discuss termination, renewal windows, and end-of-term obligations, and none of the three expires in March. One agreement that does expire that month never appears at all, because its language reads like boilerplate and lands nowhere near the question in the index. The person asking opens the contract database by hand, finds four March expirations in under a minute, and stops trusting the assistant. Every contract in that system carries an expiration date in a structured field, so the right answer was one SQL query away the whole time.

In Retrieval: let the product look things up before it answers, you gave the product a lookup step so its answers could stand on records instead of training data. That chapter treated the lookup as a single box, but there are several ways to look something up, each one wins a different kind of question, and the contracts assistant failed because it brought a similarity search to a database question. This chapter gives you the menu and a rule for choosing from it.

The four retrieval methods and the question each one wins

Every question your product fields can be served by one of four methods, and each has a home turf.

  • Keyword search matches exact words. The index records which words appear in which documents, and the search returns the documents containing the question's words, weighted so rare terms count more than common ones. It wins on names, product codes, error strings, invoice numbers, and any question where the asker knows the exact term, such as TLS_HANDSHAKE_TIMEOUT or "clause 14.2". It fails when people paraphrase, because "refund after cancellation" will never match a document that only says "post-termination reimbursement".
  • Vector search matches meanings. An embedding turns the meaning of a passage into a list of numbers, placed so passages with similar meanings land near each other; the search embeds the question the same way and returns the nearest passages. It wins the paraphrase and concept questions keyword search loses, and it loses the exact terms keyword search wins, because a part number or an error string carries almost no meaning to embed, and two unrelated codes can land side by side.
  • Hybrid search runs both and merges the two lists. Real questions mix the forms, an exact product name next to a paraphrased concept in one sentence, so running both searches and merging the results is the sane default for any knowledge base made of prose.
  • Structured lookup runs a real query. When the question has fields, dates, owners, amounts, or statuses, the answer lives in a database or a graph, and the right retrieval is SQL or a graph query rather than any search over prose. "Which agreements expire in March" is a filter on a date column, and similarity has nothing to offer it, which is exactly how the contracts assistant failed.

Rerankers: pay for a second pass only when it earns its keep

The first search pass is built to be fast and to cast a wide net, so its ranking is rough. A reranker is a second, more careful pass: the fast search returns a few dozen candidates, and a slower model scores each candidate against the question and reorders them so the best ones sit on top. The first pass works to avoid missing the answer; the reranker works to put the answer first.

That second pass costs a beat of latency and a model call on every query, so it earns its keep in specific conditions: when only the top three to five passages get handed to the model, when the first pass returns plausible near-misses (support macros that almost apply, near-duplicate policy versions), and when answer quality is worth a slightly slower response. It earns nothing on structured lookups, since a query over fields is already exact, and little when the first pass already puts the right passage on top.

Classify the question before you choose the index

The way out of the contracts failure is a habit rather than a technology: before you pick an index, classify the question it has to answer.

The question's form picks the index: exact words want keyword search, meanings want embeddings, fields want a real query, and mixed prose wants both.

The classification happens twice. At design time, you pull a sample of the real questions your product will field and sort each one into exact term, concept, field, or relationship, and the mix tells you which indexes to build at all; plenty of teams discover that most of their traffic is field questions and the prose corpus is a sideshow. At query time, the product routes each incoming question the same way, with a rules pass or a small, cheap classification call that sends field questions to the database and prose questions to hybrid search. Relationship questions, the kind that chain entities together ("which vendors subcontract to vendors we already audited"), want a graph query, though when they are rare a couple of SQL joins usually covers them.

Chunking: what you index as one unit decides what can be found

Chunking is the decision of how to split documents into the units the index stores and returns, and it reads like a plumbing detail while behaving like a product decision.

What you index as one unit is what you can retrieve as one unit, so chunk size is a decision about what a good answer needs to quote.

The trade runs in both directions:

  • Chunk too small and the product returns the one clause that answers the question while the definitions section that gives the clause its meaning stays behind.
  • Chunk too large and every question drags in dozens of pages, most of them noise, spending the window you will learn to budget in Context budgets: fit the right facts into a finite window.

The workable range runs from a paragraph to a few pages, sized by what a complete answer needs to quote, usually with a line or two of neighboring context attached so a retrieved unit still reads on its own. Decide it by reading answers rather than tuning parameters: take your most common questions, look at what the product retrieved for each, and ask whether the unit on screen was enough to answer from. The current tooling for chunk sizes, overlap, and attaching context moves too fast for chapter prose, so the specifics live in the dated Retrieval Stack Sheet.

Try it now

This drill takes about 30 minutes and produces the routing evidence for your own corpus.

Pick a document set you know. Choose twenty or more documents where you can judge a result instantly, such as your product docs, a contracts folder, or a support macro library.

Write ten questions, deliberately mixed. Write three exact-term questions (a code, a person's name, a phrase you know appears verbatim), three concept questions (paraphrase the content so no distinctive words are shared), two field questions (a date, an owner, an amount), and two mixed questions that put an exact term and a concept in one sentence.

Run every question through a keyword search and an embedding search. Most vector databases expose both, and some document tools do too. Scale it down: if nothing is set up, a free-tier vector database embeds a few dozen documents for pennies, and the drill needs relative rankings rather than production infrastructure.

Score each question. Note which method put the right document first, which placed it anywhere in the top five, and which missed it entirely.

Check the winners against the decision tree. The exact-term questions should have gone to keyword search and the concept questions to embeddings, and the field questions probably embarrassed both, which is the drill's real lesson: those two belonged to a query, and no amount of search tuning would have saved them.

Chapter Summary

  • Retrieval is not one method: keyword search matches exact words, vector search matches meanings, hybrid merges both lists, and structured lookup runs a real query over fields.
  • Keyword search wins names, codes, and error strings; embeddings win paraphrase and concept questions; each one fails where the other wins.
  • Hybrid search is the sane default for a prose knowledge base, because real questions mix exact terms and paraphrase in one sentence.
  • Any question with fields, such as dates, owners, or amounts, belongs to SQL or a graph, never to similarity search.
  • A reranker is a second, more careful pass that reorders the top results; pay for it when only a few passages reach the model and the first pass is noisy.
  • Classify the question before choosing the index: the question's form (exact term, concept, field, or relationship) picks the method.
  • Chunking is a product decision, because what you index as one unit is what you can retrieve as one unit.
  • Everything here retrieves from documents the product looks up fresh on each question; what it keeps about each user between sessions is a different kind of record, and Memory: decide what your product remembers takes that on next.

Sources

  • Robertson, S. and Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval.
  • Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP.
  • Anthropic engineering blog (2024). Introducing Contextual Retrieval.
  • The Retrieval Stack Sheet for current chunking, hybrid, and reranker tooling (last verified July 2026).
Marks this chapter complete on your course map. Reaching the end does this for you.