How we evaluate retrieval quality on our own corpus

We get the same question from prospects every week: how do you know your retrieval is good? The honest answer is that almost no proposal-software vendor can answer it with a number, because almost no vendor measures retrieval against a labeled set on a corpus that looks like a customer’s. We do measure it. This post is how.

It’s also a long post. The mechanics of retrieval evaluation are the load-bearing thing under everything we say about grounded AI. If we don’t measure, we don’t know. If we don’t know, the Grounded-AI Pledge is a slogan instead of a contract. So this is the whole picture: gold set construction, metrics, the harness, the directional numbers, and the regression-guard. Skip to the section that’s relevant to your job.

A note on the numbers. The metric values in this post are directional. They are produced by our internal eval harness on our own gold sets across our own customer corpora. They are not a general benchmark and we would not expect them to reproduce on an external setup. What is durable here is the methodology and the shape of the results; the specific numbers are a snapshot that moves with every gold-set extension and every pipeline change. We will publish the harness and the public slice of our gold set when both are stable.

Why retrieval eval is hard for proposals

The standard retrieval benchmarks were not built for our problem.

MTEB ranks embedding models on dozens of tasks — clustering, classification, semantic textual similarity, retrieval over Wikipedia and CQADupStack and Touche-2020. BEIR collected 18 retrieval datasets ranging from BioASQ to FiQA. MS MARCO is search queries from Bing. All three are useful and none of them measure what a proposal team needs.

A proposal team’s retrieval is not “find the document about cardiology in PubMed” or “find the StackOverflow answer to a Python question.” It is: a buyer asked about our SOC 2 Type II posture, our subprocessor list, our incident-response SLA, and our integration with their ERP — give me the specific block in our knowledge base that answers each one, and tell me when no block does. The query distribution is narrow, the corpus is private, and the cost of a wrong block is a fabricated commitment in a contract.

We have looked at the available evaluation tooling closely. RAGAS — which we genuinely like — provides automated metrics including faithfulness, answer relevance, and context precision/recall, computed by a judge LLM. ARES trains lightweight judges on synthetic queries and evaluates context relevance, answer faithfulness, and answer relevance. TruLens gives a feedback-functions framework for LLM-app evals. These are good tools and we use parts of them, but none ship a labeled gold set for proposal corpora because no such public corpus exists. Customers don’t open-source their KBs, and for good reason.

So the eval problem reduces to: build a labeled gold set on customer data with permission, measure with classical IR metrics plus the ones we care about, and re-measure on every change.

That’s what we did, and what the rest of this post describes.

Our gold set construction

A gold set is a list of (query, expected-block-ids) pairs. The query is something a real user would type into the search bar or paste from an RFP. The expected blocks are the ones a human labeler said are correct answers to that query, with a graded relevance score (usually 0 / 1 / 2 — irrelevant / relevant / highly relevant).

Five customer corpora live behind our gold set. All five gave us explicit, written permission to use anonymized versions of their KB for internal evaluation. That permission matters legally, and it also matters editorially: we will not put anonymized customer text into a public benchmark, and we will not name the customers in this post. What we can say is that the corpora span vertical-software, healthcare-IT, govcon services, fintech, and developer tooling. They average around 8,000 indexed blocks each. They each contain the kinds of artifacts a real proposal KB contains — security questionnaires, prior winning RFP responses, product one-pagers, compliance attestations, MSAs, master subscription agreements, past performance write-ups.

The query side is built three ways:

Replayed real queries. With customer permission, we sampled queries the product had run on their corpus over the prior six months. Anonymized, deduplicated, and stripped of any value that looked like a contract identifier. These are the most valuable queries — they are exactly what users type.
RFP-extracted queries. We took ~40 public RFPs (state-procurement and federal, nothing private) and ran them through our extraction pipeline, which produces individual question records. Each question is a candidate query against an indexed corpus. The customers’ SMEs labeled which blocks in their KB would answer that question if their company were to bid on the RFP.
Adversarial queries. A small set (~150) of queries written specifically to test failure modes — multi-clause questions, numeric-identifier lookups, queries that the human labeler knew had no good answer in the corpus and should produce an “ungrounded” verdict. The point of adversarial queries is to make sure the system can refuse, not just retrieve.

The labeling process: two labelers per query, both proposal practitioners (one PursuitAgent, one the customer’s SME). Each labeler is shown the query and the top 50 retrieved candidates from a baseline retriever (so the labeler is not asked to read the whole corpus). They mark each candidate 0/1/2. Disagreements are resolved by a third reviewer. We measured Cohen’s kappa across labelers on a 200-pair subsample; agreement landed in the 0.7–0.8 range, which is the IR-eval norm for graded relevance — not great, but workable, and consistent with what TREC tracks publish on their own labeled sets.

The current set is approximately 3,000 query-block pairs across the five corpora, distributed roughly: 60% replayed real queries, 30% RFP-extracted, 10% adversarial. The set grows by about 200 pairs a month as we onboard new corpora and as the ratio of “ungrounded” queries — the ones we want the system to refuse — gets richer.

A note on a thing we got wrong early: we initially built the gold set on a single customer’s corpus. The metrics looked great. They were great — for that customer’s writing style, vocabulary, and document topology. The first time we onboarded a new customer with a different vertical, our precision dropped by what looked like 15 points. The lesson: a gold set on one corpus is a benchmark of that corpus, not of retrieval quality. Five corpora is the floor that gave us results that generalized in onboarding.

A second thing we got wrong: we initially treated all labeled blocks as equally relevant. A block was either “the answer” or “not the answer.” That binary masked the difference between “the canonical block answers this question completely” and “this block has a sentence that contributes to the answer but isn’t the canonical block.” When we moved to graded relevance — 0/1/2, with 2 reserved for canonical — the metrics became more honest, the per-query diffs became more useful, and the labelers found the work less ambiguous. Graded relevance also enables nDCG, which is a more discriminating ranking metric than binary precision@k. We should have started with graded labels. We didn’t, because the binary version is faster to build, and the cost of the rebuild was a couple of weeks of relabeling that we could have avoided.

Permission and anonymization deserve their own paragraph. Every customer in the gold set signed a written agreement covering specifically: the corpus may be used for retrieval evaluation; the corpus and the labels stay on PursuitAgent infrastructure; nothing leaves except aggregate metrics; the customer can pull permission at any time and we delete the indexed copy. Anonymization is layered — named entities are replaced with consistent pseudonyms (so “Acme Corp” becomes “Customer-A” everywhere it appears), product names with category placeholders, and any identifier-shaped data (account numbers, contract IDs) with synthetic values that preserve length and shape but no real information. The pseudonym mapping lives in a separate, access-controlled store; the indexed corpus and the gold-set labels never touch the un-anonymized text. This matters legally and it also matters for the eval itself: we want the metrics to reflect retrieval quality on text that looks like real customer text, including the parts that vary across corpora, without exposing any customer’s content to anyone else’s evaluation.

Metrics we track

The metrics fall in three buckets.

Classical IR metrics. Precision@k for k in 10. Recall@20. Mean Reciprocal Rank (MRR). Normalized Discounted Cumulative Gain (nDCG). These have been the IR field’s baseline for two decades. Precision@k tells us “of the top k blocks the retriever returned, what fraction were labeled relevant.” Recall@20 tells us “of the blocks that should have been returned, how many made it into the top 20.” MRR tells us how quickly the first relevant result appears. nDCG accounts for graded relevance — a block scored 2 (highly relevant) at rank 1 is worth more than a block scored 1 at rank 1.

We track precision@5 as the headline, because the drafting engine consumes the top 5 retrieved blocks. Precision@5 is the metric that most directly maps to “did the drafter see the right material.” Recall@20 is the secondary headline — it’s a ceiling check. If recall@20 is bad, no amount of reranking will save us.

Ungrounded-detection metrics. A retriever isn’t done when it returns blocks; it has to know when it shouldn’t return any. We split queries into “answerable” (the gold set has at least one labeled-relevant block) and “ungrounded” (the gold set says no block in the corpus answers this query). We measure the system’s true-positive rate on ungrounded queries — i.e., how often it correctly refuses — and its false-refusal rate on answerable queries. These two numbers move in opposite directions when you tune the retrieval-floor threshold; we track both and look at the ROC.

Claim-coverage rate. This is our own metric, and the one we care about most. After the drafting engine produces an answer, we ask: of the substantive claims in the drafted sentence, what fraction have a span in a gold-labeled block that supports them? It’s similar to RAGAS’s faithfulness metric but computed against gold labels rather than against the retrieved context, which means it’s harder to game. A draft that paraphrases an off-topic block will score high on faithfulness-against-context (the block was retrieved, the model rephrased it) but low on claim-coverage (the block isn’t in the gold set for that query).

What we don’t use: BLEU, ROUGE, METEOR. These are word-overlap metrics designed for translation and summarization. Proposal text is not summarization. A well-written answer can be a complete rewrite of the source block, and a poorly-written answer can be near-verbatim copy of the wrong block. Word overlap is uncorrelated with the thing we care about. The Stanford HAI legal-RAG paper makes this point sharply: “citation present but unsupported” is a different failure than “wrong words.” We measure the supported part.

The eval harness, on a laptop

The harness is one command:

pnpm eval:retrieval --corpus=corpus-3 --gold=2025-06

Behind that command:

Load the gold set for the named corpus and date.
Load the corpus’s index.
For each query, run the retrieval pipeline (embedding, ANN search, reranker, threshold gate).
Compute precision@k, recall@20, MRR, nDCG, ungrounded-detection metrics, claim-coverage.
Write a JSON report to eval-runs/<timestamp>.json and a Markdown summary to stdout.

The output looks roughly like this:

=== eval run 2025-06-28T14:21:00Z ===
corpus:     corpus-3 (n_blocks=8,142)
gold set:   2025-06 (n_queries=621, n_labeled_pairs=2,847)
embedder:   text-embedding-3-large
reranker:   bge-reranker-v2

precision@1   : 0.81
precision@3   : 0.76
precision@5   : 0.72
precision@10  : 0.61
recall@20     : 0.88
MRR           : 0.79
nDCG@10       : 0.74

ungrounded TP rate    : 0.83
answerable FR rate    : 0.07

claim-coverage        : 0.86

A developer working on retrieval runs this on their laptop before opening a PR. The whole run takes around three minutes per corpus on the dev machine — the embedding cost is paid at index-build time, not at eval time, because the gold-set queries are pre-embedded and cached. Reranking is the slow step. We ship the harness with a --no-rerank flag for the early development loop where you just want to know if the embeddings moved.

The dev workflow looks like this:

# branch from main
git checkout -b retrieval-tuning

# baseline run on the corpus you're tuning against
pnpm eval:retrieval --corpus=corpus-3 --gold=2025-06 --label=baseline

# make a change — new chunking, new reranker, new threshold
$EDITOR backend/src/kb/retrieval/rerank.ts

# rerun
pnpm eval:retrieval --corpus=corpus-3 --gold=2025-06 --label=after-rerank-change

# diff
pnpm eval:diff baseline after-rerank-change

eval:diff prints a side-by-side of the two runs and flags any metric that moved by more than 0.5 points. It also flags individual queries whose top-1 retrieved block changed — the per-query diff is where most of the actual learning happens, because aggregate metrics hide important regressions.

We will write a follow-up specifically on the harness’s CLI ergonomics (Day 68 — The eval harness, on the command line). This post is about the methodology behind it.

What the numbers actually look like

We are publishing directional numbers, not precise claims. The numbers below describe the current state on our most-recent gold set as of late June 2025, averaged across the five corpora. A specific customer’s corpus will land above or below these.

precision@5: ~0.7X. Our headline metric runs in the low-to-mid 0.7s most weeks. It has crept up over the last two quarters as we added the cross-encoder reranker on top of the dense first stage. It is not 0.9. We don’t claim it is.
recall@20: ~0.85–0.90. The ceiling check is generally healthier than precision because we cast a wide net before reranking. When recall@20 dips below 0.85 in a release, that is the alarm bell that something upstream of the reranker has changed.
MRR: ~0.75–0.80. The first relevant result usually appears at rank 1 or 2.
claim-coverage: ~0.85. This number has been the most stable across changes — because the drafting engine refuses when entailment fails, the claims that do ship tend to be the well-grounded ones.
ungrounded true-positive rate: ~0.80–0.85. The system catches most of the queries it should refuse. It still misses some. False refusals on answerable queries run around 0.05–0.08.

Where we are weak, honestly:

Numeric-identifier queries. “Show me the answer to question 4.7.1.b in our last response.” Embeddings handle these poorly because the semantically-relevant signal is the identifier, not the surrounding language. We have a hybrid path that falls back to BM25 on identifier-shaped queries; it is in production and adequate, not great.
Multi-clause questions. “Describe your incident-response SLA, the response time, and how it differs for paying vs. trial customers.” A single block rarely answers all three clauses. Our current path retrieves blocks that answer the dominant clause and we miss on the others. We are working on multi-block claim composition; it’s not in the production drafting path yet.
Cross-document synonym questions. When the customer’s KB calls something “Service Provider Agreement” and the buyer’s RFP calls it “Vendor Master Agreement,” we are dependent on the embedding model handling that synonym. The dense retriever generally does. The reranker sometimes overweights surface form and demotes the correct block. This is a tuning ongoing.

We do not publish a single all-time precision number with three decimal places, because it would be misleading. Retrieval quality varies by corpus, by query type, by week. What we will commit to publicly is: every quarter, we publish a state-of-retrieval post with directional numbers and the regressions we caught. The first one lands in early Q4 2025.

How we regression-guard

The eval harness is a developer tool. The regression-guard is a CI tool.

Every PR that touches backend/src/kb/retrieval/** or backend/src/kb/embed/** triggers a CI job that runs the eval suite on a held-out gold-set slice — a pinned 600-query subset that’s been locked since early 2025. Pinning matters: if the gold set changes underneath the regression test, you can’t tell whether a metric moved because the code changed or because the labels changed.

The CI job has thresholds:

precision@5 cannot drop more than 0.5 absolute points vs. main.
recall@20 cannot drop more than 0.5 points.
claim-coverage cannot drop more than 1.0 points.
ungrounded TP rate cannot drop more than 1.5 points.

If any threshold is breached, the build fails. The PR cannot merge until either the regression is fixed or the threshold is raised with a written justification in the PR description, signed off by a second engineer. The latter happens; the latter is the right answer when, say, the chunking strategy changed and the gold set’s old blocks no longer exist. In that case the gold set needs to be rebuilt against the new index, the run repeated, and the threshold reset. We have done this twice in the last six months. Both times the explicit override was the correct call, and the discipline of forcing a written justification kept us honest.

The “I changed the chunking, scores moved” case deserves a callout because it’s the one most teams handle badly. When the chunking strategy changes, the index changes, which means the labeled blocks in the gold set may not exist anymore. The naive fix is to re-label the gold set against the new index, which produces a “regression” of zero by construction — useless. The right fix is to keep the queries, throw out the block-level labels for any block that changed, and re-label only the affected pairs. We have a script that does the diff and queues only the re-label work that’s actually needed; it cuts a full re-label cycle from about a week to about half a day.

When the regression CI fails, the diff review is where the engineering happens. Two questions get asked, in order:

Was the regression real or was it noise? Run the same eval three times and look at the variance. We know the harness is roughly deterministic per-run (same embeddings, same reranker, same query order), so within-run variance is near-zero. Across-run variance is from upstream model nondeterminism (LLM judges in claim-coverage, mostly). If the metric moves outside the noise band, it’s real.
If real, which queries regressed? The per-query diff is the actual answer. Usually 5–20 specific queries account for most of the metric movement. Looking at those queries — and the blocks the system used to retrieve vs. the blocks it now retrieves — points at the cause faster than staring at aggregate numbers.

We over-rotated early on aggregate metrics and under-rotated on per-query inspection. The lesson cost us about three weeks of chasing a phantom “precision regression” that turned out to be twelve queries whose gold labels were stale. Per-query first, aggregates second.

A separate lesson on threshold-setting: we picked the initial regression thresholds (0.5 absolute points on precision@5, 1.0 on claim-coverage, etc.) by looking at week-over-week variance on main during a quiet engineering period. The numbers moved by less than half a point in a week with no retrieval-side commits, which gave us the noise floor. The threshold sits comfortably above that noise floor without being so loose it lets meaningful regressions through. We re-validate the noise floor every quarter; if the underlying models are upgraded — say a new embedding-model release — the noise floor can shift, and the thresholds need a corresponding re-baseline. We did this in February when we moved embedders, and we will do it again the next time. Thresholds are not constants; they are calibrated against the variance the system actually exhibits.

One more thing about CI failures that surprised us: the most common cause of a regression-guard failure is not a bad PR. It is a flaky external dependency — the LLM judge that computes claim-coverage occasionally times out and falls through to a degraded mode that scores worse than its mainline path. We added retries and a strict timeout policy after the third PR got bounced for what turned out to be an outage at the judge model’s provider. The harness now re-runs claim-coverage up to three times if the judge returns a refusal or a timeout, and falls through to a hard-fail (rather than a degraded score) if all three fail. This made the eval more reliable and made the engineers more willing to trust the results. A flaky eval is worse than a slow eval; a slow eval is worse than no eval; no eval is worse than what we want.

What’s next

Three things on the eval roadmap.

Claim-level eval. Today’s claim-coverage metric is computed at the sentence level. A drafted sentence often contains multiple distinct claims (e.g., “We hold SOC 2 Type II and ISO 27001, audited annually by a Big Four firm” — three claims). Sentence-level eval gives partial credit incorrectly. Claim-level eval — extracting individual claims and verifying each — gives a stricter, more useful number. We have a research branch on it. It is slower and more expensive to run, which is why it isn’t in the daily harness yet.

Composite grounded-score. Right now we report a half-dozen metrics. Internally we want a single number that combines them, weighted toward the failure modes that hurt customers most (uncaught ungrounded queries, fabricated numerics, off-topic retrievals). The trick is making the composite fail loudly when any individual component goes off a cliff, rather than averaging the cliff out. We are looking at min-aggregation rather than weighted-mean.

Cross-customer transferability. The interesting research question is: when a metric improves on corpora 1–4, does it improve on corpus 5? We are starting to track that as a leave-one-out signal, because the production reality is that we onboard new customers and need our retriever to generalize, not to overfit the corpora we’ve already seen.

If you want to read the engineering side of grounded retrieval more broadly, the grounded-retrieval pillar piece is the canonical post. If you want to read about how the gold sets are constructed in more practical detail, testing retrieval gold sets is a deeper cut into the labeling workflow.

The short version of this post: we measure retrieval against a 3,000-pair gold set across five customer corpora; precision@5 sits in the low-to-mid 0.7s; recall@20 around 0.85–0.90; claim-coverage around 0.85; CI fails the build on regressions over a tight threshold. None of those numbers will go up just because we wrote a blog post about them. They go up when we change something — and we measure whether they did.

This post is by the PursuitAgent engineering team. Engineering posts are a shared byline rather than a single author; views reflect PursuitAgent’s position and are written by the engineers building the product.