Embedding evaluation, revisited

The first months of embedding evaluation were “run the gold set, read the numbers, argue about the deltas.” This quarter we’re approaching it with a more disciplined idea of what we’re measuring and why.

This is a short post about the pieces that changed. The full retrieval retrospective lands in the one-year grounded retrieval pillar later this week; this one is just the embedding eval slice.

What the gold set looks like now

A year ago we had 80 hand-authored questions across three corpora. We wrote the curation recipe in how we curate the golden set. The questions were useful for directional signal: did a change help or hurt?

The current set is 1,240 questions across 11 corpora. The growth came in three phases:

Corpus breadth. Every new customer workload gets 40 new questions in the first two weeks. That’s the biggest contributor to set size.
Regression capture. Every production incident that surfaces a retrieval bug becomes a gold question before the fix ships. This is the second-biggest contributor and arguably the highest-value one per question.
Structured annotation. We added question-type tags to every question: factual lookup, compound reasoning, discriminator claim, numeric, temporal. Tags let us report per-type metrics instead of one pooled number.

The corpus count is where the surprise was. A year ago we thought 300 questions on three corpora was enough. It wasn’t. Federal acquisition language has its own grammar. Healthcare compliance has its own cross-reference patterns. Adding a corpus surfaces failure modes that don’t exist in the others, and those failure modes are invisible until a corpus exists.

Metrics that earned their place

Four metrics run on every eval and every CI run:

Recall@10. Does the retriever surface the must-cite blocks for a question?
Answer faithfulness. Is every sentence in the generated answer entailed by a cited source?
Answer completeness. Does the answer cover every fact the reference answer covers?
Abstention precision. When the system refuses, is that the right call?

All four are version-one RAG metrics. None of them are novel. RAGAS implements versions of the same four. We use our own implementations because our entailment check is a moving part we want to control, but anyone starting fresh should run RAGAS as a baseline.

What earned its place this year was the per-question-type breakdown. A single “recall@10 is 0.87” number hides that compound-reasoning recall is 0.71 and factual-lookup recall is 0.94. The aggregate number moves slowly; the per-type numbers move when we change things that touch that type. We now report all four metrics at both the pooled and the per-type level.

Metrics we retired

Cosine similarity between generated answer and reference answer. This was in the early dashboard. It correlates with faithfulness on questions that have one reasonable phrasing, and it correlates with nothing useful on questions that admit multiple correct answers. We noticed it moving in the wrong direction on changes that were clearly improvements and retired it.

MRR on the reranker output. A reasonable metric in principle. In practice, MRR and recall@10 moved together on every change we cared about, and MRR added noise on ties. We kept recall@10 and dropped MRR from the CI report. The reranker code still emits MRR for debugging; nobody reads it.

Mean cost per eval question. We track median and P95 cost separately. The mean is dominated by a small number of outlier questions that need a second retrieval pass, and the mean moves when the outlier count changes even if the cost distribution didn’t change. Median is a better summary; P95 is the one we defend on.

What runs where

Every merge to main runs a 200-question stratified subset in CI. That job finishes in about 8 minutes. It reports four metric deltas against the previous main; if any crosses a threshold — the tightest is answer faithfulness, which fails CI on a 3-point drop — the merge blocks until a human reviews.

The full 1,240-question set runs nightly on main. The nightly job produces a report that gets pasted into the engineering channel. Slow metrics we care about but don’t block on — median cost per answer, P95 draft latency, cache hit rate on entailment — land in the nightly report.

We also run the full set before every embedding or reranker change, against both the current production config and the candidate. The diff is the shipping criterion. We wrote about the protocol in retrieval evaluation part 2.

Open question

The pooled eval set treats all corpora as equally important. That’s fine as a first-pass measurement and wrong as a customer-facing one. A civic-procurement vendor doesn’t care about our healthcare-corpus recall, and vice versa. We’re working on a per-tenant eval slice that lets a customer annotate 40 questions against their own KB and get metrics back specific to their workload. Scoped, not yet shipped. More on this in the pillar post on Thursday.