Our retrieval latency budget, explained

A retrieval call inside the PursuitAgent drafting engine has a budget. P50 is 120ms. P95 is 400ms. If we miss either, the drafting UI feels slow, and a reviewer who feels the slowness stops trusting the citations. This post is the flame graph in prose: where the milliseconds go, what we cut, and what we still owe.

Why a budget at all

A drafted answer is a chain. Retrieval feeds the rewrite prompt; the rewrite produces a candidate sentence; the verifier entails it against the source block. If retrieval is slow, the whole chain is slow. If retrieval is unpredictable, the chain is unpredictable, and a P95 spike on retrieval becomes a P95 spike on time-to-first-token in the UI.

We treat retrieval as the load-bearing internal API. Everything downstream gets to assume retrieval is bounded.

The five stages, by milliseconds

A single retrieval call goes through five stages. The numbers below are P50 from our production telemetry, with P95 in parentheses. They will move when we reindex; they have moved twice this quarter already.

Stage	P50	P95	What it does
Embedding lookup	18ms	45ms	Generate query embedding
Vector search (HNSW)	22ms	70ms	ANN search over pgvector
Hybrid merge (BM25 + vector)	14ms	35ms	Merge keyword + semantic candidates
Reranker	48ms	180ms	Cross-encoder rerank top 50 → top 10
Payload hydration	18ms	70ms	Fetch block text, page refs, metadata
Total	120ms	400ms

Five stages, one budget. The reranker dominates — almost always.

Stage 1 — Embedding lookup

The first thing we do with a query is generate an embedding. We use a 1536-dim model from one of the two large embedding providers (the choice is configurable at deploy time; we’ve shipped two). The cold-path call to a hosted embedding API is 60ms-plus. We don’t take that path on the hot path.

We keep an in-process LRU cache keyed on the normalized query. RFP questions repeat across drafts of the same proposal — buyers ask the same compliance question in three sections — and that pattern hits the cache. Cache hit rate in production today sits around 38%. When the cache hits, the lookup is two-digit microseconds and disappears from the budget. When it misses, we pay 18ms-ish on the warm provider connection.

The cache is per-process, not global. We tried a shared Redis cache; the network round-trip wiped the win.

Stage 2 — Vector search

We index KB blocks in pgvector with HNSW, m=16, ef_construction=200. At query time we tune ef_search per call, dropping it for short queries and raising it for long ones. The default is 80.

SET LOCAL hnsw.ef_search = 80;
SELECT block_id, embedding <=> $1 AS distance
FROM kb_blocks
WHERE company_id = $2 AND active = true
ORDER BY embedding <=> $1
LIMIT 50;

The query planner cooperates because company_id filters the index partition before the ANN scan. We don’t trust query planners by default — we measured it. With company-scoped partitions and a tuned ef_search, P50 sits at 22ms over corpora in the 50K-block range. At 500K we expect this to climb. We have a research branch that pre-filters with a learned classifier; not on the hot path yet.

Stage 3 — Hybrid merge

Vector search alone misses two things: rare proper nouns (product names, certifications, statute references) and exact-match queries where lexical precision matters more than semantic neighborhood.

We run BM25 in parallel against the same blocks and merge the candidate lists with reciprocal rank fusion. The merge is cheap — milliseconds — but the parallel run isn’t free. We cap the BM25 candidate set at 100 and the vector candidate set at 50, merge, take top 50 forward.

The win from hybrid is real but not dramatic on most queries. On the tail — queries that mention a specific RFP clause number, a CFR reference, a product SKU — the win is large enough to justify keeping it on the hot path.

Stage 4 — Reranker

This is where the budget gets spent.

A cross-encoder rerank scores each candidate block against the query as a pair, rather than scoring them independently. The score is more accurate. The cost is that 50 candidates means 50 forward passes through the model. We run the reranker on a co-located CPU inference path (ONNX, quantized) so we don’t pay the network hop, but the per-pass cost is still 1ms-ish, and the variance on contention is the P95 spike.

We rerank top 50 down to top 10. Below 50 the recall drops; above 50 the latency spikes. We tuned this against held-out RFP queries with eval rubrics from a public corpus.

Two things we tried and reverted:

Skip rerank on high-confidence vector match. Sounds reasonable; broke recall on ambiguous queries. Reverted.
GPU rerank. Faster per pass, but cold-start variance and GPU pool contention made P99 worse. Reverted.

The thing we’re still working on: a smaller reranker for the easy 60% of queries with a fallback to the larger one for the hard 40%. The classifier that decides “easy vs. hard” needs to itself be cheap, which is the trick.

Stage 5 — Payload hydration

After the top 10 block IDs come out of rerank, we fetch the actual content. Block text, page references, document metadata, version tags, citation pointers. Five small Postgres queries that we batch into one with WHERE block_id = ANY($1).

P50 here is 18ms because the connection pool is warm and the rows are small. P95 climbs when a block has heavy attached metadata — extraction provenance, prior reviewer annotations, linked diagram blocks. We considered moving heavy metadata to a separate fetch behind a “load on demand” flag in the UI, but the drafting engine itself uses some of the metadata (the version tag in particular) so the lazy-load got pushed to the UI layer instead.

Where it breaks

Three places.

Cold cache after deploy. A fresh container with empty embedding cache pays the full embedding API round-trip on every query for the first 90 seconds. We pre-warm with the top 1,000 questions from the prior week’s logs at boot. Most of the cold-cache cost moves there.

Bursty corpora. A customer who indexed 80,000 blocks last night runs a query at 9am. The HNSW graph is cold in the page cache. The first query takes 200ms on the vector stage instead of 22ms. Subsequent queries warm. We schedule a query-warmer job after large indexing runs.

Long-tail reranker P99. Once a day, on a query whose candidate list happens to land on a CPU-contended host, the reranker spikes to 600ms. We could pin the reranker to a dedicated pool. We haven’t yet, because the spike is rare and the pool cost isn’t trivial. This is on the list.

The short version

The budget is 120ms P50, 400ms P95, broken across embedding lookup, vector search, hybrid merge, reranker, and payload hydration. The reranker dominates. We cut what we could without hurting recall, accepted the rest, and built monitoring that tells us the day a stage drifts out of budget.

The next post in this thread covers the rewrite stage — the prompt, the model, the refusal path — and where its latency budget lands. Different shape, different tradeoffs.