How we curate the retrieval gold set

Every meaningful change to our retrieval pipeline runs through a held-out evaluation set before it ships. This post walks through how the gold set is curated. The recipe is straightforward; the discipline that makes it useful is the disagreement-resolution protocol underneath.

The current production gold set has 1,840 triples. We rotate a 120-question working slice each quarter to track regression candidates without exhausting the full set; the full set is run on major architecture changes only. Reasoning: running the full 1,840 against every pipeline change is expensive in verifier time, and a 120-question slice gives us 90% of the signal at 7% of the cost.

A note on the numbers. Counts (1,840 triples, 120-question slice, the agreement rates below) describe the state of our internal gold set at the time of writing. Any eval numbers that get produced against it are directional — they describe our system on our corpora, not a general benchmark. We will publish the public slice of this gold set (described at the end of the post) so external readers can run their own numbers; the internal numbers that come out of the full set are not a benchmark.

What a triple looks like

Each gold-set entry is a triple of three pieces.

question: |
  What encryption-at-rest mechanism does the platform use for
  customer-uploaded documents?
expected_block:
  block_id: blk_a87f...
  document_id: doc_security_arch_v3
  page_range: [12, 13]
expected_answer: |
  AES-256 encryption applied at the storage layer using AWS KMS
  customer-managed keys, with per-tenant key isolation.
metadata:
  domain: security_questionnaire
  difficulty: standard
  added_by: r-patel
  added_at: 2025-06-04
  reviewed_by: [a-chen, m-okoro]
  resolution: consensus

The question is the natural-language form an evaluator might pose. The expected_block is the source-of-truth block ID; precision@k is measured against this. The expected_answer is the entailed assertion the verifier should produce. The metadata captures who added the triple, who reviewed it, what domain it’s in, and how the disagreement (if any) was resolved.

Where the questions come from

Three sources, deliberately balanced.

Public RFPs. State and federal procurement portals publish thousands of RFPs every quarter. We pull a sample, extract the questions, normalize them into the gold-set form, and tag the domain (federal IT, healthcare, security questionnaire, etc.). Public-RFP questions skew toward formal language and detailed requirement enumeration, which is one important slice of what real evaluators ask.

Redacted customer data, with permission. Some PursuitAgent customers grant us permission to use their submitted RFP questions (with the buyer-identifying language redacted) as gold-set inputs. This slice skews more toward the conversational variants — the kinds of questions that appear in less-formal DDQs and SaaS security questionnaires. The permission is logged per-customer; we revoke per-customer if a customer opts out.

Synthetic, human-verified. A third slice is generated by us — an engineer or a Sarah-pen domain-expert writes a question that targets a known KB block — and then verified by a second human against the canonical answer. Synthetic triples let us cover failure modes we want explicit coverage on (numeric-precision questions, compound claims across blocks, vocabulary translation between standard and customer-specific terms).

The three sources are roughly balanced 40/30/30. We watch the proportions and rebalance when one source’s coverage gets thin.

The annotation protocol

Every triple is reviewed by three annotators before it enters the gold set. The protocol is short.

Step 1: an annotator writes the triple. The author proposes a question, identifies the expected source block, and writes the expected answer. The triple enters a pending-review state.

Step 2: two reviewers annotate independently. Each reviewer either confirms the triple or flags it. Confirmation is the question is realistic, the expected_block is the right source, the expected_answer is correctly entailed by the block. Flags name which of the three is wrong: question unrealistic, wrong source block, answer not entailed (or under/overstated).

Step 3: resolution. If both reviewers confirm, the triple enters the gold set with metadata resolution: consensus. If one confirms and one flags, the author and the flagging reviewer discuss; the triple either gets revised and re-entered, or gets dropped. If both reviewers flag, the triple is dropped and the failure is logged in the gold-set quality tracker.

We track the per-author flag rate. New annotators flag at higher rates as they calibrate to the team’s standards. Sustained high flag rates across an annotator usually indicate a systematic issue — overspecified expected answers, source-block confusion, etc. — and trigger a calibration session.

Disagreement resolution

The most important part of the protocol is what happens when two reviewers disagree about whether an answer is entailed by a source block. This is the case that exposes the gap between “looks correct” and “is correct,” and it is where the gold set’s discipline lives.

Three categories of disagreement, with three different resolutions.

Numeric precision disputes. Reviewer A says “AES-256” entails “AES-256-GCM”; reviewer B says they are different. Resolution: defer to the source block’s actual phrasing. If the block says “AES-256-GCM,” the expected answer must say “AES-256-GCM” or a superset that includes the GCM mode. The block does not say AES-256-GCM, so the expected answer says only AES-256. The defensible posture is to be strict on what the block actually states and accept lossiness in the expected answer. Numeric-precision disputes used to be the most common disagreement category; they have moved to second place since we tightened the recipe.

Synonym and tense disputes. Reviewer A says “we encrypt at rest” entails “encryption is applied to data at rest”; reviewer B says they are different because the second is passive. Resolution: synonym-tense equivalence is accepted at the gold-set level. The verifier we ship may or may not handle the equivalence well in the live system, but the gold set should encode the semantic ground truth, not the verifier’s current state. If the verifier fails on tense shifts, that shows up as a verifier-quality regression in the eval, which is a separate bug.

Compound-claim disputes. Reviewer A says the answer “we encrypt at rest with AES-256-GCM and key rotation every 90 days” is fully entailed by the block. Reviewer B says the block describes encryption but not key rotation; key rotation is in a different section of the same document. Resolution: the answer is split into two atomic claims, the encryption claim is entailed by the cited block, the key-rotation claim is a separate triple with a separate expected_block pointing to the section that describes key rotation. Compound-claim splitting is the most labor-intensive resolution but the one that improves gold-set quality the most.

What we don’t include

Two things, deliberately.

Questions we generated using LLMs without human verification. Synthetic-LLM-only triples introduce a feedback loop where the gold set drifts toward what the LLM finds easy. We have run experiments on LLM-generated questions; they are useful for stress-testing the retrieval ranker on adversarial paraphrases, but they don’t enter the production gold set without human verification.

Questions whose expected_block is the most recently approved version. Gold-set triples pin to a specific block version, not “the current version.” When the block is updated, the triple either updates explicitly (we re-review) or gets retired. This prevents the gold set from silently shifting under a pipeline that did not change.

What we plan to open-source

A subset of the gold set — the synthetic-and-public-RFP slices, scrubbed of any customer-permissioned content — is going on GitHub later this quarter. Roughly 800 triples in the first release, with the schema and a CLI for running an external retrieval system against them. This will give the community a domain-specific eval set for proposal-and-RFP retrieval that the existing public datasets (RAGAS, ARES) don’t cover well.

The customer-permissioned slice stays internal. We do not have permission to publish it.

Why this is worth the effort

Without a gold set, “we improved retrieval” is an opinion. With one, it is a number. Every change we make to the retrieval pipeline produces a quantitative read against the gold set, and most of the time that read tells us either “this change moved the precision@1 number up by 1.4%, ship it” or “this change moved the number down by 0.6%, don’t ship it.” The 5% of the time the gold set fails to produce a clear signal — usually because the change addresses a failure mode the gold set doesn’t cover — we update the gold set to cover the gap before re-evaluating.

The set is the product team’s most expensive artifact and our most-defensible one. Every regression we caught before shipping was caught here. Every regression we shipped was something the gold set didn’t cover yet. The next post in the build-log series, two weeks out, covers the second installment of retrieval-evaluation work on numeric claims — a failure mode we’re explicitly expanding the gold set against.