Our eval harness, on the command line

The companion post to this one is How we evaluate retrieval quality on our own corpus — that’s the methodology piece. This post is shorter and more practical: how the eval harness actually feels to use when you sit down to tune retrieval on a Tuesday morning.

The premise: every developer touching the retrieval pipeline runs the same harness, with the same gold sets, in the same way. The whole point is to remove the variable of “which eval did you run, and on what.” If two engineers are tuning two parts of the stack and reporting numbers, those numbers should be comparable without translation.

A note on the numbers below. The metric values in the example diffs are illustrative of the shape a developer sees on their laptop. They come from our internal eval harness running against our own corpora and gold sets. They are directional and not a benchmark — we do not expect them to reproduce on external setups, and we will publish the harness and a public gold set when both are stable.

Here’s the loop.

The three commands

# 1. baseline — capture current main behavior
pnpm eval:retrieval --corpus=corpus-3 --gold=2025-06 --label=main-baseline

# 2. iterate — run the same eval after a change
pnpm eval:retrieval --corpus=corpus-3 --gold=2025-06 --label=after-rerank-tweak

# 3. diff — compare runs
pnpm eval:diff main-baseline after-rerank-tweak

That’s the dev loop. Three commands, in that order, run dozens of times a day during a tuning sprint.

The corpora are named (corpus-1 through corpus-5, plus the synthetic test corpus we use for unit testing). The gold sets are date-stamped (2025-06, etc.) so a re-run on the same date pairs against the same labels even after the gold set has been extended. The --label flag is free-text — the run is stored under that label in eval-runs/<corpus>/<label>.json, so you can come back next week and reference it by name.

What the run actually does

The harness is a single TypeScript entry point that boots the retrieval pipeline outside the HTTP server. It loads the corpus’s index, loads the gold-set queries, runs each query through the pipeline (embed → ANN → rerank → threshold), captures top-k results per query, and computes the metrics the pillar post describes — precision@k, recall@20, MRR, nDCG, ungrounded TP rate, claim-coverage.

// backend/src/eval/run.ts (sketch)
const corpus = await loadCorpus(args.corpus);
const goldSet = await loadGoldSet(args.gold);

const results: PerQueryResult[] = [];
for (const query of goldSet.queries) {
  const retrieved = await retrievalPipeline.run(query.text, {
    topK: 20,
    rerank: !args.noRerank,
  });
  results.push({
    queryId: query.id,
    retrieved,
    expected: goldSet.labelsFor(query.id),
  });
}

const metrics = computeMetrics(results);
await writeRun(args.label, { args, metrics, results });
printSummary(metrics);

A run on a single corpus completes in roughly three minutes on a recent dev laptop. The slow step is the cross-encoder reranker; everything else is cached or fast. The --no-rerank flag drops a run to under 30 seconds, which is the version we use during the tightest tuning loop where we just want to know if the embedding side moved.

What the diff looks like

The diff is the part that gets used most. It does two jobs.

Aggregate diff. A side-by-side of every metric, with deltas highlighted if the move exceeds a threshold (0.5 absolute points by default).

                    main-baseline    after-rerank-tweak    delta
precision@1         0.81             0.82                  +0.01
precision@3         0.76             0.77                  +0.01
precision@5         0.72             0.74                  +0.02  *
precision@10        0.61             0.63                  +0.02
recall@20           0.88             0.88                  +0.00
MRR                 0.79             0.80                  +0.01
nDCG@10             0.74             0.75                  +0.01
ungrounded TP rate  0.83             0.81                  -0.02  !
answerable FR rate  0.07             0.09                  +0.02  !
claim-coverage      0.86             0.87                  +0.01

The asterisks flag positive moves outside noise. The exclamation marks flag negative moves outside noise. A diff with a clean asterisk on precision@5 and a clean exclamation on ungrounded TP rate tells you immediately what tradeoff you just made — a small lift on top-5 precision at the cost of some refusal capability. That tradeoff might be the right call. It might not. The point is that the diff makes you see it instead of letting you ship a precision win that quietly broke refusals.

Per-query diff. The lower-volume but more important output. Lists every query whose top-1 retrieved block changed between runs. For each, prints: the query text, the old block ID + score + label, the new block ID + score + label.

Q-1832  "what's our DDQ for [redacted internal product]?"
  before: blk-9201 (score 0.74, label 1)
  after:  blk-9203 (score 0.79, label 2)
  -> upgrade

Q-2104  "how does our incident response SLA differ for trial accounts?"
  before: blk-7411 (score 0.71, label 2)
  after:  blk-6809 (score 0.72, label 0)
  -> regression

Q-2299  "show me the answer to question 4.7.1.b in our last response"
  before: blk-(none)  (refused)
  after:  blk-3104 (score 0.63, label 0)
  -> false-positive (was correctly refusing)

Three lines per query, one of three verdicts: upgrade, regression, false-positive (the system used to correctly refuse and now over-confidently answers wrong) or its mirror, false-refusal (used to answer correctly, now refuses).

The per-query diff is where most of the actual learning happens. Aggregate metrics tell you that something moved. The per-query diff tells you why.

Pre-PR usage

A typical pull-request workflow on a retrieval change:

# branch from main
git checkout -b retrieve-bm25-fallback

# baseline against the corpus you're targeting
pnpm eval:retrieval --corpus=corpus-2 --gold=2025-06 --label=main-baseline

# do the work — write the BM25 fallback for identifier queries
$EDITOR backend/src/kb/retrieval/hybrid.ts

# rerun
pnpm eval:retrieval --corpus=corpus-2 --gold=2025-06 --label=bm25-fallback

# diff
pnpm eval:diff main-baseline bm25-fallback

If the diff looks good — improvements on the targeted query type, no regression elsewhere — the PR description includes the diff output as a fenced block. Reviewers see the same numbers the author saw. There is nothing to translate, nothing to take on faith.

If the diff is mixed — improvement on the targeted query type, small regression elsewhere — the PR description states the tradeoff explicitly. The discussion happens on the tradeoff, not on whether the change “feels right.”

The CI job runs the same harness on the held-out 600-query slice, with thresholds. A PR that passes the developer’s diff but fails CI is usually a sign that the developer ran the diff against a corpus that was easier than the held-out slice. The corpora differ; that’s the whole point of running across five.

What the harness does not do

A few honest gaps.

It does not test end-to-end UX. The harness measures retrieval and (via claim-coverage) drafted-sentence faithfulness against gold blocks. It does not click around the proposal-builder UI, check whether the verify-button works, or verify the SSE stream completes. Those are e2e tests, run separately by Playwright.

It does not measure latency or cost. Retrieval-only latency is recorded per-query in the run JSON, but the aggregates we put in the summary are quality metrics. Latency tuning has a separate harness; cost tracking is a different system entirely. Mixing them in one tool would be a worse tool.

It does not generate the gold set. Gold sets are built by the labeling workflow, with humans, with disagreement-resolution. The harness consumes them. We considered building synthetic-gold-set generation into the harness — RAGAS does this — and decided against it for our use case. Synthetic gold sets are useful for early development; they are not a substitute for labeled customer data when the corpus and queries are private.

The principle behind the ergonomics

Three commands, the same three every time, the same arguments every time. We treat the eval harness like a unit-test runner. You wouldn’t tolerate a test runner where each developer ran a different command with different arguments and reported numbers in incompatible formats. Retrieval evaluation should be the same.

The cost of that discipline is real. The harness has been rewritten three times and the diff format has been negotiated through a half-dozen PR discussions. The benefit is that an engineer working on retrieval today does not have to invent the eval methodology before they invent the change. They run the same three commands the engineer who shipped the last change ran, against the same gold sets, and the comparison is honest by default.

That’s what makes tuning a habit instead of a project. Tune by feel and you ship regressions you can’t see. Tune by harness and you ship changes you can argue for, with evidence other people on the team can re-derive in three minutes.

This post is by the PursuitAgent engineering team. Engineering posts are a shared byline rather than a single author; views reflect PursuitAgent’s position and are written by the engineers building the product.