Grounded Retrieval 101, Part 1: what RAG is and why it still hallucinates

This is the first post in a four-part series on grounded retrieval. Each part is short, technical, and focused on one mechanism. The series is written for proposal-team practitioners who are evaluating RAG-based products, and for engineers who are building them.

This post covers what retrieval-augmented generation actually is, and why retrieval alone does not stop hallucination.

RAG in three sentences

A retrieval-augmented generation system answers a question in two steps. First, it retrieves passages from a corpus that look relevant to the question, using either keyword matching or vector embeddings. Second, it passes those passages to a language model along with the question, and the model writes an answer that, in principle, draws from the retrieved passages.

That’s it. There is no third sentence. RAG is two function calls.

The reason RAG matters in proposal software is that it bounds what a model can write. Instead of asking a generic model “what’s our security posture?” — and getting a fluent paragraph composed from the model’s training data, which is to say, from no specific company’s actual posture — you ask it the same question and pass it your KB content blocks about security. The model writes from the blocks. The blocks are yours. The output is, in principle, about your company.

In principle.

The principle and the practice diverge

Stanford HAI’s 2024 study, Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools, evaluated three commercial legal research products that all advertise grounded AI: Lexis+ AI, Westlaw AI-Assisted Research, and Ask Practical Law. All three are RAG systems running over curated, high-quality legal corpora. The study found hallucination rates between 17% and 33%, depending on the product and the question category.

These are not naive systems. They were built by serious teams with high-quality corpora and access to the same engineering resources we have. The hallucinations they produce are with retrieval, with citations, with domain-tuned prompting. The citations often look correct on inspection — a real case, a real paragraph from a real document — and the generated sentence still misrepresents what the cited source actually says.

That gap has a name in the literature: source attribution vs. claim support. The system attributes a claim to a source. Whether the source actually supports the claim is a separate question, and a typical RAG pipeline doesn’t ask it.

Why retrieval alone is insufficient

A long-running Hacker News thread titled “The issue of hallucinations won’t be solved with the RAG approach” walks through the mechanism in plainer terms. The argument, which mostly holds, goes like this:

Retrieval surfaces passages that are relevant to a question. Relevance is a similarity score between an embedding of the query and an embedding of the passage. Similarity is not the same as entailment. A passage about “data encryption practices” is relevant to a question about “encryption at rest” — it might be the most relevant passage in the corpus — and might still not contain a specific statement about encryption at rest. The relevance score doesn’t know this. It just knows the passage is about the right topic.

The generator then receives the relevant passage and is asked to answer the question. The generator is helpful. The generator wants to produce an answer. If the passage is on-topic but doesn’t contain the specific claim the question demands, the generator will produce the claim anyway, citing the passage. The cited passage is on-topic — so the citation looks fine to a casual reviewer — but the claim is fabricated, drawn from the model’s training data rather than from the retrieved passage.

This is the dominant failure mode of grounded AI. It is not a model-quality problem. A more capable model fabricates more fluently. The structural fix has to live above the model.

Why this matters for proposal work

AutogenAI wrote about this in proposal context — hallucinations in proposals show up as invented case studies, incorrect compliance claims, and fabricated statistics, all delivered with confidence and slipping past review when the team is on deadline. A proposal that fabricates a SOC 2 control statement, or invents an ARR number, or references a customer engagement that didn’t happen, is worse than no proposal. It exposes the seller to contractual risk and embarrasses the relationship if discovered.

The risk profile here is high precisely because the failure mode is silent. A keyword search that returns the wrong block produces an obvious miss — the writer notices and fixes. A drafted sentence that cites the wrong neighbor of the right block reads correctly. The reviewer sees a citation. The citation goes to a real, plausible passage. Unless the reviewer reads the passage and verifies the entailment manually, the failure ships.

What this series will cover

This part: why retrieval alone doesn’t stop hallucination.

Part 2 (next week): why citations don’t guarantee groundedness — the entailment gap, and why a separate verification step is the only reliable defense.

Part 3: the verification mechanism we run in production — what it is, what it costs, where it fails.

Part 4: how we measure all of this — the held-out evaluation set, the metrics, and the honest version of what the numbers say.

The short version

RAG is two function calls: retrieve, then generate. Both calls have well-known failure modes. The retrieve call surfaces relevant-but-not-supporting passages. The generate call writes a confident answer regardless. The combination is fluent, citation-bearing, and wrong between 17 and 33% of the time even on well-built systems.

The fix is not better retrieval. The fix is a step that retrieval and generation don’t include by default — one that asks, for every drafted claim, whether the cited source actually entails the claim. Part 2 is about that step.