Confidence scores for grounded drafts, explained

When the drafting engine returns a sentence with “82% confident” next to it, the reviewer reasonably asks: confident about what? This post is the honest answer.

What the number is, and isn’t

The confidence score on a drafted sentence is the probability that the sentence’s substantive claims are entailed by at least one block in the retrieval set. It is not:

The model’s perplexity over the sentence.
The retrieval similarity score.
The model’s self-reported confidence (we ignore that signal entirely; the Stanford HAI legal-RAG paper showed across multiple commercial systems that self-reported confidence is barely correlated with truth).
A guarantee.

The number is a calibrated probability output by an entailment classifier. “82% confident” is meant to read as: in the population of sentences this engine has scored at 82%, roughly 82% of them were judged supported by the retrieval set when sampled and reviewed.

How it’s computed

Each drafted sentence runs through three signals, combined into the displayed score:

Signal 1 — entailment classifier (weight 0.6). A DeBERTa-v3-large model fine-tuned on NLI plus our own labeled set takes (sentence, top-k retrieved blocks) and returns a probability of entailment. This is the dominant signal.

Signal 2 — retrieval gap (weight 0.25). The reranker produces scores for each retrieved block. We compute the gap between the top block and the fifth block — a wide gap means retrieval is decisive; a narrow gap means the top block isn’t notably better than the fifth and the retrieval is ambiguous. Sentences supported by an ambiguous retrieval get a lower confidence even when entailment is high.

Signal 3 — claim density (weight 0.15). A sentence with one factual claim is easier to ground than a sentence with four. We extract claim count via a small claim-extractor model. Confidence drops with claim density when entailment is computed over the whole sentence, because partial entailment hides one bad claim among three good ones. (We’re working on per-claim entailment, mentioned in the retrieval-eval pillar — when it ships, this signal goes away.)

Combined:

confidence = 0.6 * entailment_prob
           + 0.25 * normalize(retrieval_gap)
           + 0.15 * (1 / max(1, claim_count))

The output gets calibrated by Platt scaling against a held-out reviewer-judged set, recomputed quarterly.

What the calibration looked like in June 2025

The calibration plot for our most recent release:

predicted   actual
band        support rate    n
─────────   ─────────────   ─────
0.95–1.00       0.97         412
0.85–0.95       0.88         631
0.70–0.85       0.74        1,109
0.50–0.70       0.55         684
0.30–0.50       0.36         287
< 0.30          0.18         142

Calibration is not perfect. The 0.95–1.00 band is slightly under-calibrated (model says 97%, actual is 97% — close enough). The 0.50–0.70 band is over-calibrated by ~5 points (model says 60%, actual is 55%). We retune Platt every quarter to keep the gap below 5 points across all bands. This is roughly the discipline RAGAS recommends for calibrated faithfulness scoring, applied to our own labeled set rather than judge-LLM outputs.

A note on what “actual” means in that table: actual is the rate at which a human reviewer, given the sentence and the retrieved blocks, agreed that at least one block supported the sentence’s claims. Two reviewers, disagreement adjudicated by a third. The cost of building the calibration set is real — about 60 hours of reviewer time per quarter — and it is non-negotiable. A confidence score without calibration is a graph with no axis labels.

Three thresholds, three behaviors

The confidence score drives three reviewer-facing behaviors:

Above 0.85 — citation badge, green. The sentence ships with a citation link to the top-supporting block. The reviewer can click through to verify; most don’t. The audit log records the score and the cited block.

0.50–0.85 — citation badge, yellow, plus a “review” affordance. The sentence ships, but the badge is amber and there is a click-to-inspect affordance that opens a side panel showing the top-3 retrieved blocks side by side with the sentence. The reviewer is being asked to confirm.

Below 0.50 — refusal. The sentence is replaced with a “no confident source” placeholder and a click-to-retry that lets the reviewer rewrite the prompt or expand the retrieval set. We do not ship low-confidence sentences with a “yellow” warning, because below 0.50 the calibration shows the sentence is more likely wrong than right.

The thresholds are not magic constants. They were picked by looking at the per-band support rates above and choosing thresholds where the false-confidence rate was tolerable. We re-validate the thresholds every quarter.

Where this connects to the budget

The hallucination budget post describes a per-claim error rate we are willing to ship with. The confidence threshold for refusal is the operational lever that controls that rate. Lower the threshold and more sentences ship — including more wrong ones. Raise it and the system refuses more often, which costs draft completeness. The choice is a trade-off, not an optimization.

Currently we sit at 0.50 for refusal. The implied false-confidence rate is around 5–8% on shipped low-band sentences (the 0.50–0.70 band, which is the weakest band that still ships). That is the operating point, written into the engine config, reviewed at every release.

What the number cannot tell you

Three failure modes the confidence score will not catch:

Stale source. The retrieved block was correct two years ago and is wrong now. Entailment passes; the truth is gone. This is a content-freshness problem, not a confidence problem. We surface staleness via the freshness-alerts feature.
Off-topic retrieval that happens to be self-consistent. The top block is about a different product; the sentence paraphrases it; entailment scores high because the sentence is faithful to the (wrong) block. This is the failure mode the gold-set-driven precision@5 metric catches in development — it does not catch it at runtime. The mitigation is keeping precision@5 on the headline metric and accepting that runtime cannot replicate the gold-set check.
Confidently wrong numerics. “We have 47 SOC 2 audits in our history.” Entailment scores high because a block contains the number 47 in a different context. Numeric claims are the failure mode our retrieval is weakest on; the retrieval-eval pillar names this honestly. We are working on numeric-claim isolation, not yet shipped.

The score is useful, calibrated, and incomplete. We display it because reviewers asked for it and because it does most of the work it’s meant to do. We do not let it stand alone — the citation link, the inspect-retrieval affordance, and the audit log are the safety net underneath the number.

The takeaway

A confidence number that isn’t calibrated is decoration. A calibrated number with three reviewer-facing behaviors and a quarterly recalibration is a tool. The discipline isn’t the math; it’s the labeled set behind the math, and the willingness to refuse below threshold instead of shipping a yellow warning the reviewer ignores.