The prompt test suite, an update

We have 300 prompt tests. About 270 pass deterministically. About 25 pass stochastically with a retry budget. About 5 flake in ways we haven’t fully characterized. This post is what the suite covers, what it doesn’t, and how we’ve kept the flaky end from poisoning the whole.

What the suite is for

Prompt tests are not the same as gold-set evaluation. The gold set measures end-to-end behavior on realistic questions. Prompt tests measure specific prompt behavior on controlled inputs: given this system prompt and this user message, does the output match an expected schema, contain or not contain certain strings, and score above a threshold on a specific check?

Gold-set eval catches regressions in the overall retrieval-plus-generation pipeline. Prompt tests catch regressions in a specific prompt’s behavior when the prompt is edited.

Both run in CI. Gold-set runs on main nightly plus a 200-question subset on every merge. Prompt tests run on every pull request, in about 90 seconds.

What the 300 tests cover

Split across six prompt families:

Drafting (120 tests). The prompts that take retrieved context and produce a proposal section. Tests cover: does the output stay within the cited sources, does it refuse when grounding is missing, does it produce the expected structural format (executive summary vs. security-question answer vs. discriminator-claim block).
Claim extraction (60 tests). The prompt that turns a sentence into its constituent claims. Tests cover: does it split compound claims, does it preserve numeric values exactly, does it mark qualitative vs. quantitative claims.
Entailment check (50 tests). The prompt that judges whether a source entails a sentence. Tests cover: true-positive entailments, true-negative entailments, tricky cases with hedging language, numeric-precision cases.
Abstention (30 tests). The prompt that decides when to refuse. Tests cover: when the retrieved context is empty, when it’s thin, when it’s off-topic, when a single block could technically answer but with low confidence.
Win-theme extraction (20 tests). The newer suite from the win-loss engineering work. Tests cover: theme identification, deduplication, false-positive filtering.
Structural formatting (20 tests). Checks that the generator respects output schemas — required JSON fields, bullet vs. prose, word-count caps.

Which tests flake and why

Five consistently flaky tests, same five most of the time:

A compound-claim decomposition test where one of the claims is structurally ambiguous. The correct split depends on reader interpretation; the LLM doesn’t always pick the same split twice. We’ve kept it because the correct behaviors are both “right” — the test is really checking that the output contains all the semantic content, not that the split is identical.
An entailment test on a hedged source (“the system generally supports…”) against a confident claim (“the system supports…”). The LLM is right to flag this as non-entailment about 80% of the time. The 20% it entails is defensible but not what we want. We retry up to three times and take the majority.
A structural formatting test for a bullet list that sometimes comes back as a numbered list. The generator knows the difference, mostly. We loosened the test to accept either and moved on.
A win-theme extraction test with two near-duplicate themes. The LLM sometimes merges them, sometimes keeps both. Either is fine; the test checks that no third theme appears, and that passes deterministically.
A drafting test where the retrieved context is borderline-sufficient. The refusal decision could reasonably go either way. We retry with a tightened refusal threshold for this specific test.

The pattern across all five: the flake is a real ambiguity in the input, not a flaw in the prompt. We’ve learned to treat “this test flakes” as a signal to inspect the test, not to mute it.

What the suite misses

Three things it doesn’t catch:

Emergent behavior from prompt edits we didn’t think to test. If we edit the drafting prompt to add a new instruction, the suite checks the behaviors it already knew about. It doesn’t proactively discover new behaviors the new instruction might have introduced. That gap gets caught by the nightly gold-set run, not by the prompt tests.

Cross-prompt regressions. A change to the claim-extraction prompt can degrade entailment’s behavior downstream. Each prompt tests in isolation. End-to-end behavior is the gold set’s job; we don’t duplicate it here.

Model-version drift. A prompt that passed last week on claude-sonnet-4-6-20260201 can behave differently on claude-sonnet-4-6-20260215. We pin model versions in CI and update them on a schedule with a re-run. Between updates, the suite is blind to upstream changes.

What we tell ourselves on flakes

One rule: a test that flakes without a named reason is a broken test. Every flake gets either characterized (“this input is genuinely ambiguous, here’s the reasoning”) or fixed. Muting a flake without a reason leads to a suite nobody trusts. A distrusted suite gets ignored. An ignored suite isn’t a suite.

The five flaky tests listed above each have a file-local comment explaining the ambiguity. New engineers reading the suite should be able to tell, from the comment alone, whether a new flake is expected or not.

Running the suite

The harness is an extension of our eval harness CLI. Same invocation pattern:

pnpm eval prompt --family drafting --verbose

90 seconds for the full suite in CI. Locally, it’s faster if the LLM caches are warm. The test runner output links to the prompt diff and the test input for any failing test, so debugging is a click away from the failure line.

Next on the suite: adding a cross-prompt integration layer that catches the drafting-plus-entailment regressions the gold set currently has to catch. More on that when it ships.