# The Retrieval Stack Sheet

**Last verified: July 2026.**

We keep the chapters in this part at order-of-magnitude, so this sheet is the one place the volatile names and prices live. When it disagrees with a provider's docs, the docs win.

## Embeddings and vector stores

A small team picks one embedding model and one store, and both are painful to swap later, so choose once and write the choice down.

- **OpenAI text-embedding-3-large** is the safe default with the widest integration support.
- **Voyage AI** (now part of MongoDB) is the pick when retrieval quality is your bottleneck.
- **Cohere Embed v4** is strong on multilingual work and pairs naturally with Cohere Rerank.
- **BGE-M3** is the open-weight choice when you self-host.
- **pgvector** fits when you already run Postgres and hold well under a hundred million vectors.
- **Pinecone** fits when you want a managed service and zero operations work.
- **Qdrant** and **Weaviate** fit when you want open source, metadata filtering, and native hybrid search.
- **Chroma** fits for local prototypes and evals.
- **Turbopuffer** sits on object storage and fits when the bill at scale is the problem.

## Keyword, hybrid, and rerankers

Production search stacks almost always combine lexical and vector retrieval, then rerank the merged list.

- **BM25** keyword search is still the workhorse; Elasticsearch and OpenSearch are its common homes, and most vector stores now ship it natively.
- **Hybrid fusion** means running both retrievers and merging with reciprocal rank fusion, keeping a few dozen candidates.
- **Rerankers** then reorder those candidates: Cohere Rerank and Voyage rerank-2.5 are the common managed picks, ZeroEntropy's zerank models lead some current leaderboards, and open cross-encoders like the BGE rerankers cover self-hosting.

## Agentic retrieval

Every major provider now lets the model call search as a hosted tool, which turns retrieval into a loop the model drives.

- **OpenAI** offers web search and file search tools in the Responses API; file search is a managed vector store on their side.
- **Anthropic** offers web search and web fetch tools on the Claude API, and you can return your own retrieval as cited search result blocks.
- **Google** offers Grounding with Google Search plus the File Search tool on the Gemini API, a managed RAG service with citations built in.

## Memory tooling

Memory is retrieval pointed at the model's own past, and these are the current approaches.

- **Provider features:** Anthropic's memory tool lets Claude read and write a memory store you host; ChatGPT ships memory as a product feature rather than an API primitive.
- **Mem0** is a managed memory layer that extracts what is worth keeping from each conversation.
- **Zep** builds a temporal knowledge graph (Graphiti) for facts that change over time.
- **Letta**, descended from MemGPT, is an agent runtime where the agent manages its own memory.
- **Plain files** that the agent reads and writes remain a legitimate v1.

## Context features worth knowing

Three pricing levers change the economics of context, all at order-of-magnitude level.

- **Prompt caching:** repeated prefix tokens cost about a tenth of normal input price on current Anthropic and OpenAI models, so cache the system prompt and the documents and pay full price only for the question.
- **Long-context tiers:** standard windows sit in the hundreds of thousands of tokens; million-token tiers exist and typically bill at roughly double past a threshold.
- **Batch discounts:** work that can wait a day costs about half price, and the discount stacks with caching.

Prices and names rot fast, so check the provider's current docs, and this sheet's date, before trusting either.
