Detecting ungrounded spans in drafts, line by line
A per-sentence classifier that flags which spans in a drafted RFP answer lack source coverage in the retrieved context. What it costs, what it catches, and what it still misses.
A drafted RFP answer is a paragraph of prose. Some of its sentences are supported by the retrieved context. Some aren’t. The interesting engineering question is which sentences in which paragraph, and how to tell at write time rather than at review time.
This post is about the per-sentence classifier that runs on every drafted answer in the product. It flags spans that don’t have source coverage — meaning no sentence in the retrieved context entails the claim in the draft. The classifier is small, cheap, and catches a useful fraction of the fabrication pattern that Stanford HAI’s research on commercial legal RAG documented at 17 to 33% of outputs. It also misses things. This post is about both.
The setup
Every drafted answer goes through this pipeline:
- The drafting model produces a candidate paragraph from the retrieved context.
- The paragraph is split into sentences.
- Each sentence is paired against the retrieved context chunks that fed the draft.
- The classifier returns a grounded/ungrounded label per sentence with a confidence score.
- The UI renders the paragraph with ungrounded spans underlined and offers the reviewer a one-click “replace with sourced claim” or “remove.”
The goal is not to prevent ungrounded sentences from appearing in the draft. We’d need to retrain the drafting model for that, and we’d still lose recall on genuinely supported claims that the retriever missed. The goal is to name the ungrounded sentences cheaply enough that the reviewer sees them before the bid goes out.
The classifier
The classifier is a fine-tuned cross-encoder based on a publicly available NLI-style base model. Input is a pair: (sentence from draft, concatenated retrieved context). Output is a three-way label: entailed, neutral, or contradicted. We treat “neutral” and “contradicted” both as ungrounded; the reviewer sees them both as underlined spans in the draft.
Fine-tuning dataset: 18,000 labeled pairs we generated across 2025. Two-thirds are real drafted sentences from our own pipeline, labeled by a reviewer who had both the drafted sentence and the retrieved context in front of them. One-third are synthetic hard negatives — sentences we generated to sound plausible but have no support in the retrieved context, to force the classifier to learn “surface plausibility is not entailment.”
Inference cost per sentence: roughly 1.5 milliseconds on our co-located CPU inference path. A typical RFP answer is 8 to 14 sentences. Per-answer classifier cost: 15 to 25 milliseconds. This is cheap enough that we run it on every draft, not only on drafts that the drafting model flagged as low-confidence.
What it catches
Three patterns, in rough order of frequency in production:
Unsupported quantification. The draft says “our platform has processed over 40 million API requests across customer deployments.” The retrieved context mentions the product and its API. It does not contain “40 million.” The classifier flags the sentence as neutral. The reviewer sees it underlined. About 35% of our flagged spans fall here. This is the pattern AutogenAI named as “fabricated statistics” and the one that does the most damage when it escapes review.
Invented customer or case reference. The draft says “a Fortune 500 healthcare customer saw a 40% reduction in onboarding time.” The retrieved context mentions healthcare capabilities in the abstract. It does not name a customer. Classifier flag rate here is lower, maybe 12% of spans, because the drafting model is now prompted against inventing named customers. When it does happen, it’s usually a vague reference (“a leading healthcare provider”) that the classifier catches because the retrieved context doesn’t contain the anchor term.
Compliance overclaim. The draft asserts a certification the retrieved context doesn’t support. “We are SOC 2 Type II certified for all regions” when the retrieved block says “SOC 2 Type II for US and EU.” This is the pattern that gets proposals disqualified. The classifier catches the full overclaim about 80% of the time because the contradicting context is usually retrieved alongside. It misses the 20% where the correct context wasn’t retrieved at all — which is a retrieval problem masquerading as a verification problem.
What it misses
Partial entailment. The draft says “we encrypt data at rest and in transit with AES-256 and TLS 1.3.” The retrieved context supports “AES-256 at rest” and “TLS in transit,” but not “TLS 1.3” specifically. The classifier labels the sentence as entailed because most of it is. The subtle overclaim — the specific version number — slips through. We are exploring a sub-clause classifier that operates on sentence fragments, not full sentences. Not shipped.
Correct claims against bad retrieval. A sentence can be factually correct about the company and still flagged as ungrounded because the retrieved context happened not to include the source block. The classifier only sees the draft and the retrieved context, not the rest of the KB. The reviewer then spends time adjudicating a true claim. We ship a “search the KB for support” button next to each flagged span to make this cheap, but it’s still friction.
Tone-derived implicature. The draft says “our platform is trusted by the largest enterprises in financial services.” The retrieved context mentions customers in financial services, including at least one identifiable top-bank. The classifier labels the sentence as entailed. “Trusted by the largest” is a tone-derived claim that most reviewers would want to temper, but it isn’t technically contradicted by the context. This is the class of problem a classifier can’t solve — it’s a style problem, not a grounding problem.
The human-in-the-loop contract
The classifier changes the reviewer’s job. Before we shipped it, a reviewer scanned every sentence for plausibility. Now the reviewer scans the flagged sentences first — they are the highest-probability problems — and audits the unflagged sentences at lower rigor. The reviewer’s time goes to the right places.
The contract we make with the reviewer is explicit: the classifier is a triage tool, not a certification. An unflagged sentence is not certified as correct; it is labeled “the classifier didn’t find a problem.” A flagged sentence is not certified as wrong; it is labeled “the classifier found a pattern worth checking.” This distinction is mandatory in the UI. We don’t let the language drift into “approved” or “verified,” which would let reviewers trust the tool past its actual competence.
The answer-provenance graph we shipped in Q3 is the other half of this. The classifier tells the reviewer whether a sentence is supported. The provenance graph tells the reviewer by what — which retrieved block, which KB document, which version.
The honest measurement
On our adversarial eval set, the classifier’s precision on ungrounded-span detection is 0.84; recall is 0.72. Both numbers will move as we expand the training set. Neither number is good enough to remove the human. Neither number is bad enough to skip the tool.
The question I keep getting: “why not raise recall at the cost of precision — flag more things, let the reviewer triage?” We tried. At precision 0.70, the reviewer starts ignoring the underlines because the false-positive rate is high enough that the underlines lose signal. The UI has to carry meaningful flags, and 0.84 precision is the floor at which reviewers in our usage telemetry still act on the flag. Below that, the product fails the only test that matters: does the reviewer trust it enough to use it.
What’s next
Three directions:
- Sub-sentence classification. The partial-entailment miss is the one that costs reviewers the most trust when it leaks through. This is the highest-value unsolved piece.
- Retrieval-aware flagging. When a sentence is flagged as ungrounded, we know whether the retriever pulled the right neighborhood. If it didn’t, we should label the flag differently — “this might be correct but retrieval missed the source” is different from “this looks fabricated.”
- Reviewer-action telemetry. We track which flagged spans reviewers actually edit versus which they accept as-is. That signal trains the next version of the classifier and also tells us where the false-positive rate is hurting us most.
The grounded-AI problem is not solved by a single model. It’s solved by a stack of small checks that each catch a specific failure and that are honest about what they miss. The classifier is one of those checks. It is not the whole stack.