Field notes

Grounded AI is not a feature, it's a refusal

Opinion. The thing that makes grounded AI different from regular AI is what the system refuses to do — answer when retrieval is empty. Here's what we will not ship even when reviewers ask for it.

Bo Bergstrom 6 min read Category

This is an opinion piece. The opinion is that the entire category of “grounded AI proposal software” is, with a few honest exceptions, marketing language for systems that hallucinate slightly less than ungrounded ones.

The thing that actually makes a system grounded is not what it can do. It is what it refuses to do.

The feature framing is wrong

Most vendor pages describe grounded AI as a capability. The system retrieves from your knowledge base, then generates an answer that cites the retrieved passages. Citations appear next to the generated text. The user can click through and verify. The architecture is RAG. The marketing word is “grounded.”

This framing is incomplete in a way that matters. RAG architectures hallucinate. Stanford HAI’s audit of commercial legal RAG tools — Lexis+ AI, Westlaw AI, Ask Practical Law — found hallucination rates between 17% and 33% on a controlled benchmark, despite all three systems being marketed as grounded. Citations appeared. The cited sources sometimes did not support the claims. The generated text was confidently wrong with confidently irrelevant footnotes attached.

If “grounded” is just a feature label, that result is not a contradiction. The feature shipped. The feature retrieves. The feature cites. What the feature does not do is refuse to answer when the retrieval is empty.

The refusal is the product

The version of grounded AI we are willing to ship is the one where the system refuses to generate when it cannot ground. Empty retrieval — meaning no passage in the KB scored above a threshold of relevance to the question — produces no answer. The UI says: we do not have content on this. Here are the closest matches we found, none of which we judged sufficient. A human can write the answer; we will not write it for you from training data.

This sounds simple. It is the hardest product decision we have made.

The pressure to soften the refusal is constant. Beta reviewers ask for it. They say: when retrieval is weak, give me the best the model can do, with a warning. They say: I will edit the output anyway. They say: the workflow is faster when the system always produces something, even something I throw away. Every one of these requests is reasonable. Every one of them, granted, would turn the product into the same thing the rest of the category sells — a generator that occasionally cites.

We refuse. The reason is that the failure mode of “always answer, sometimes warn” is not a slightly degraded version of grounded answering. It is a different mode entirely. AutogenAI described this failure mode accurately: hallucinations in proposals appear as “invented case studies, incorrect compliance claims, or fabricated statistics” delivered with confidence, passing early review unnoticed when the team is under deadline pressure. The warning labels the user agreed to read at 9 AM are invisible at 11 PM the night before submission.

A system that sometimes generates from training data is not a system the proposal team can trust to be honest about its sources. The user has to know, every time, whether the answer in front of them came from their KB or from the model’s prior. That knowledge is not preserved by a warning ribbon. It is only preserved by the system declining to answer when it cannot ground.

What we will not ship

Concretely, here is what we will not ship even when asked:

A “fallback to general knowledge” mode. When retrieval is empty, the system does not silently switch to the model’s training data with a softer label. It returns no answer.

A “draft this for me anyway” button on empty retrieval. Several reviewers have asked for this. They want the option, in their own UI, to say “I know retrieval was empty, but please generate something so I can edit it.” The argument is that the human is making the choice with full information. The argument is correct on its own terms. We still will not ship the button — because in practice the button gets clicked at deadline, by people who have stopped reading warnings, and the resulting paragraphs make it into the response unedited. The product’s job is to make hallucinated paragraphs hard to ship by accident. A button labeled “hallucinate this for me” makes them easy to ship by accident.

Confidence scores that imply graduated truth. Some grounded-AI tools surface a confidence score next to each generated paragraph. The score is computed from retrieval relevance and model perplexity. The implication is that 90%-confident output is mostly true and 60%-confident output is somewhat true. This is statistically defensible and operationally false. The user reads the score, applies a threshold, and ships the over-threshold paragraphs without verification — which is exactly the workflow the grounding was meant to prevent. We surface citations, not confidence scores.

“Smart synthesis” that combines retrieved passages with model generalization. Many systems blend retrieved content with generated bridge sentences that summarize, generalize, or extrapolate. The bridge sentences are where the hallucination lives. We make the retrieved passages structurally distinct from any generated text and we limit generated text to citation-bearing reformulation of the retrieved content.

What we will ship

We will ship a system that says “I don’t know” loudly and often. The first-draft hit rate on a fresh KB is lower than it would be if we relaxed the refusal. Users have to write more original content for the questions retrieval can’t answer. The compensation is that what the system does produce is verifiable end to end — every sentence traces to a passage the user can read.

The reviewers who ask us to soften the refusal are the same reviewers who, six weeks later, tell us the thing they trust about the product is that it doesn’t lie. The two facts are connected.

The category position

The argument I want to make on behalf of grounded AI as a category — and to other operators building in this space — is that the refusal is the differentiator. Anyone can build retrieval over a corpus and generate citations. The proof of grounding is not that citations appear. The proof is that the system declines to answer when it cannot honestly do so.

Most of the market does not refuse. The features look identical on a comparison matrix. The behavior on empty retrieval is where the difference shows up, and that behavior is invisible until you watch a real user run a real workflow.

We are willing to lose the comparison-matrix shootout against vendors whose grounded-AI feature ships more answers per query, more bridge sentences per answer, more confidence scores per paragraph. The thing we are selling is the refusal. The rest of the category is selling the appearance of grounding. Until proposal teams have learned to tell the two apart at evaluation, we will keep losing the demo and winning the production deployments.

The closing

Grounded AI is not a feature. It is a discipline that the system enforces against the user’s short-term pressure. A vendor that ships a “draft anyway” button has not built grounded AI — they have built an ungrounded AI with a stricter UI. The architecture is the same. The behavior at the boundary is what differs.

The next time a vendor demo lands in your inbox, ask the question that exposes the difference: what does your system do when retrieval returns nothing relevant? If the answer is “we still draft something, with a warning,” you are looking at the same hallucinating generator the rest of the market sells. If the answer is “we don’t draft, we tell the user the KB doesn’t cover it,” you are looking at something different. Buy the second kind.

Sources

  1. 1. Stanford HAI — Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools
  2. 2. AutogenAI — AI hallucination: how can proposal teams reduce risk