Our retrieval eval, quarterly report

We run our retrieval evaluation harness on every backend deploy and a longer version weekly. This post is the quarterly summary: what we measured between July and September, what regressed, what improved, and what we changed in the harness itself.

The harness is described in Eval harness CLI and Testing retrieval against gold sets. The frozen gold set is 412 question-answer pairs across CAIQ, SIG, and three customer RFPs we have permission to use as anonymized test data. Every pair has a labeled relevant block in a snapshot of a synthetic KB.

The headline numbers

Top-1 retrieval precision moved from 0.71 (June 30) to 0.79 (September 30). Top-5 recall moved from 0.88 to 0.93. Mean reciprocal rank moved from 0.78 to 0.84.

These are gold-set numbers. Customer KBs are messier than the gold set. We do not claim these as production numbers. We claim them as movement on a frozen benchmark, which is what an eval is for.

What we changed in the harness itself

Two harness changes shipped this quarter, separately from the underlying retrieval changes.

Per-slice reporting by default. We used to compute slice metrics on demand. We now compute them on every harness run and surface them on the dashboard. The reason: the August reranker regression we caught was visible only on slice numbers. If slice metrics were not on by default, we would have shipped the regression and waited for a customer to surface it.

Held-out test set discipline. The gold set is 412 pairs. We used to tune retrieval weights against the full set, then report metrics against the same set. That is the textbook way to overfit on your own benchmark. We now hold out 20% of the gold set as a never-tuned test set; weight tuning happens on the remaining 80%. The dashboard reports the held-out numbers, not the tuned numbers.

The held-out discipline cost us about 0.02 on every reported metric (the held-out set is harder than the average of the full set). It is worth the cost. Numbers we cannot trust are worse than lower numbers we can trust.

The two changes that moved the numbers

Hybrid search weighting. We covered the dense + BM25 blend in Hybrid search, dense and sparse. We re-tuned the blend weight on a held-out subset of the gold set. The old weight was 0.6 dense, 0.4 sparse. The new weight is 0.55 dense, 0.45 sparse. Small change, real movement: top-1 precision up 4 points on questions with discrete tokens (algorithm names, certification identifiers, retention periods).

Question-identifier shortcut for CAIQ and SIG. Questions in CAIQ and SIG have stable identifiers. AIS-01 in CAIQ is the same question across every CAIQ that ships. We added a pre-retrieval lookup: if the incoming question has a CAIQ or SIG identifier and the KB has a block tagged with that identifier, return that block first. Top-1 precision on CAIQ questions moved from 0.74 to 0.91. The change is mechanical, not clever.

The regression we caught and reverted

In late August we shipped a reranker change that improved overall MRR by 0.02 but tanked precision on numeric-heavy questions. Top-1 on questions containing a specific number (retention period, audit date, certification level) dropped from 0.81 to 0.69. Aggregate metrics looked fine; the slice metric did not.

We caught it because the harness reports per-slice numbers, not just aggregates. The slices we track: question contains a number, question contains a named entity, question contains a compliance framework, question is paraphrased vs. verbatim, question is from CAIQ vs. SIG vs. RFP. The reranker was helping paraphrased questions and hurting verbatim numeric ones.

Rolled back the reranker change inside 36 hours. The lesson, which we keep relearning: a single aggregate number hides the regressions that matter.

The metric we stopped reporting

We used to report “answer correctness” — a model-graded score where a stronger model evaluated whether the retrieved block contained the right answer. The Stanford HAI paper on legal RAG hallucinations is part of why we soured on this. Model-graded eval is fast and cheap and noisy in ways that correlate with the model’s own training distribution. Two versions of the same prompt produced different correctness scores against the same retrieval output.

We replaced it with two harder, slower metrics: span-level entailment (does the cited span actually contain the answer?) and human-graded correctness on a 60-question quarterly subsample. The model-graded number is gone from the dashboard. It looked precise. It was not.

The slice metrics in detail

Per-slice movement quarter over quarter:

Slice	Q2 top-1	Q3 top-1	Delta
Verbatim CAIQ questions	0.82	0.91	+0.09
Verbatim SIG Lite questions	0.78	0.85	+0.07
Paraphrased CAIQ questions	0.62	0.71	+0.09
Paraphrased SIG Lite questions	0.58	0.68	+0.10
RFP technical questions	0.71	0.76	+0.05
RFP management questions	0.69	0.74	+0.05
Numeric-token questions	0.74	0.81	+0.07
Compound questions (multi-fact)	0.55	0.61	+0.06

The slice that moved least: compound questions. Those are questions that require the answer to combine facts from two or more KB blocks. Our retrieval is good at finding the closest single block; it is weaker at recognizing that the answer needs two blocks composed. We are working on it. The technique we are testing is a second retrieval pass conditioned on the first hit’s content, but we do not have shippable numbers yet.

What ships next quarter

Three things are queued for Q4. Latency-aware retrieval that gives up on a third candidate if the first two are highly ranked. A second gold set built from a partner’s anonymized DDQ corpus — bigger, harder, more representative. And a per-tenant eval mode so customers can run the harness against their own KB and see their own numbers, not ours.

The dashboard ships publicly this week. See Shipped: the retrieval-eval dashboard for the changelog and a link to the live numbers.