One year of grounded retrieval: what changed, what didn't

A year ago this week, we published the Grounded-AI Pledge in code. That post named three invariants — pointer, provenance, entailment — and described the specific Hono routes and SQL tables that carried them. The thesis held up. Most of the code did not.

This is the engineering companion to Bo’s retrospective from January. His post was the thesis. This post is the code. We’ll walk the retrieval stack, the verification stack, and the evaluation pipeline as they stood then and as they stand now, and we’ll close on the problems we thought would be easy and weren’t, plus the ones we’re actually working on for year two.

One caveat up front. Nothing in the underlying discipline changed. The invariants we defined in the pillar post are still the contract. What changed is how we enforce them — the moving parts, the thresholds, the failure modes we can now name. A year ago we knew grounded retrieval was three invariants. Now we know which one breaks first under which conditions, and we have test coverage that proves it.

The Stanford HAI paper that set the floor for this work — 17% to 33% hallucination rates in production legal RAG systems — is still the reference we use when someone asks why we measure entailment and not similarity. That paper is now 18 months old. We have not seen commercial systems close that gap meaningfully. We have closed it for our own workload, on our own evaluation set, and we’ll tell you exactly what that sentence means by the end of this post.

The retrieval stack, one year on

A year ago, the retrieval stack was four stages. It’s still four stages. The stages are the same. The implementations of each one have been rebuilt, in one case twice.

Chunking

Then. Fixed 800-token chunks with 120-token overlap, split on paragraph boundaries when available, on sentence boundaries when not. One chunk per KB block. We wrote about the decision in how we chunk proposals and the ablation that led to 800 tokens in the chunk size ablation post.

Now. The 800-token chunk is still the default, but chunks are no longer fixed-length. We switched to a semantic splitter that respects block boundaries as hard stops, uses sliding 800-token windows inside long blocks, and emits a separate chunk per detected section header. The per-block chunk count went from a median of 1.0 to a median of 1.4 — meaning most blocks are still one chunk, but long blocks get split along their own internal structure.

What we did not change: the 800-token target. We re-ran the ablation in November with our current embedding model and the same eval set, and 800 still won against 500, 1200, and 1600 by a small but consistent margin on recall@10 for the discriminator-language questions. The ablation artifact is in the chunk size post; the re-run notebook is linked at the bottom of that page. If we ever change chunk size it will be because our blocks got substantially longer, not because a new model prefers a different window.

What we removed: the fixed overlap. 120-token overlap mattered when chunks were arbitrary; it doesn’t matter when chunks end at block boundaries or detected section headers. We kept a 40-token overlap for sliding windows inside long blocks, and zero overlap at block boundaries. That change dropped our embedding count per corpus by 11% with no measurable retrieval quality change on the gold set.

Embedding model

Then. We shipped with Gemini Embedding 2 at 3072 dimensions. We wrote about the selection in embedding model selection. The decision hinged on recall@10 on our internal proposal corpus against four then-available alternatives — Gemini Embedding 2, OpenAI text-embedding-3-large, Cohere embed-v4, and Voyage.

Now. Same model. We re-ran the selection twice in the past year, in May and in October. Neither re-run produced a candidate that beat the incumbent on our eval set by enough margin to justify a migration. The closest challenger was a domain-tuned open-weight model that won on narrow-domain security questionnaire content but lost on general proposal prose. For a system that handles both, the incumbent stays.

We did cut dimensionality. Running the same model with the dimensions: 1024 truncation parameter — Matryoshka-style — lost us a rounding-error amount of recall and cut our pgvector index size by 66%. We migrated in two phases: new blocks at 1024, then a background re-embed of the back catalog over three weeks. The migration is notable mostly for being boring. No production incidents. Cost per embedding dropped 50%. Query latency dropped a measurable but unexciting 8ms on the hot path.

The lesson: model churn is not the right default. An embedding model that’s stable for a year means that every other component downstream can be tuned against a fixed substrate. Every time you change the embedding model, you invalidate every cached retrieval, you re-test every query that used to work, and you pay re-indexing cost on your entire corpus. Those costs are real and they are not recovered by the new model being 2% better on a benchmark you didn’t choose.

Reranker

Then. No reranker. First-stage retrieval was pgvector nearest-neighbors on the embedding index, top-50 returned, trimmed to top-10 by a post-filter on recency and document type.

Now. A cross-encoder reranker on top of first-stage. We wrote the full build-out in the reranker that paid for itself. Short version: we retrieve 50, rerank to 10, and the rerank call adds 90–140ms of latency and about $0.0003 per query. Against that cost, we measured an 18-point jump in answer-faithfulness on our gold set. It paid for itself inside the first week of shipping.

What we’d do differently if we started today: we’d ship the reranker from day one, not as a year-two add. The reason we didn’t at the time was that we were still catching hallucinations at the drafting stage via entailment and we weren’t sure the reranker would move the needle enough to justify the engineering. It did. A reranker tightens the candidate set that goes into the generator’s context, and everything downstream gets cheaper and more correct.

The reranker is a BGE cross-encoder. We looked at Cohere’s hosted reranker and at a couple of open-weight alternatives; BGE won on latency at the precision we needed. We aren’t precious about this choice — a reranker is substitutable in a way the embedding model isn’t.

Hybrid search

Then. Dense-only. Vector nearest-neighbors, no BM25.

Now. Hybrid, with dense and sparse combined via reciprocal rank fusion. Details in hybrid search: dense + sparse. Short version: BM25 recovers exact-match queries that dense retrieval drops — acronyms, product SKUs, FAR clause numbers, certification IDs. Dense recovers paraphrase and conceptual queries that BM25 misses. RRF fusion gets us both and is trivial to tune.

The one-line code change — adding postgres_ts_vector to our Drizzle schema and writing a tsvector index — turned out to matter a lot for a specific class of proposal queries we hadn’t foreseen: “does our KB mention clause 52.227-14?” type questions. Those are lookup queries dressed up as retrieval queries, and dense retrieval loses to a keyword index on them every time.

The failure mode worth calling out: if you weight sparse too heavily in the RRF, you over-fit to rare terms and lose on conceptual questions. We settled on k=60 in the standard RRF formula and equal weights. Anyone starting fresh should copy that and tune later.

The verification stack, one year on

A year ago, verification meant two things: an entailment check at draft time, and a citation renderer on the output. Today it means five things. The two original checks are still there. Three more got added.

Entailment, still the load-bearing check

Then. For every generated sentence that made a factual claim, we ran a natural-language-inference check against the retrieved source span: does the source entail this sentence? If no, refuse the sentence and ask the generator to try again or abstain.

Now. Same shape, different implementation. A year ago this ran on a fine-tuned DeBERTa-v3 MNLI classifier. Today it runs through a small LLM (Gemini 3 Flash at the time of writing) with a narrow structured-output prompt that returns {entails: bool, reason: string}. The LLM is slower per call — 180ms versus 40ms — but it’s right more often on the cases the classifier failed on, especially compound claims and numerical reasoning.

We paid for the latency with a cache. Entailment results are cached by (sentence_hash, source_span_hash). Cache hit rate on the steady-state drafting flow is about 71%, which means most entailment checks are free. For a fresh RFP the cache is cold and the full check runs; by the time the same KB evidence has been cited in three bids, it’s hot.

Citation rendering

Then. Citation numbers after each sentence, linking to the source block in a side panel. Click the citation, see the block.

Now. Same, plus a hover state that shows the specific span within the block. We wrote about the UI work in citation rendering stack. The substantive change: the span is stored, not derived. A year ago we computed the highlight on the fly from a similarity match between the emitted sentence and the source text; that match was wrong about 8% of the time on long blocks. Now the span is persisted as (block_id, block_version, start_offset, end_offset) at the moment of generation. No post-hoc matching.

Claim-level verification

New. We wrote about this in the claim-level verification pass. A year ago we verified sentences. Sentences are too coarse — one sentence can carry two claims and one of them can be unsupported while the whole sentence reads as “cited.”

A claim is the smallest predicate-argument unit in a sentence. “We support SOC 2 Type II since 2021 and our SOC 2 Type I report was retired in 2020” is one sentence and four claims: support SOC 2 Type II, date 2021, previously had SOC 2 Type I, retired 2020. Each claim gets its own entailment check against its own source span. If any claim fails, the sentence gets flagged and the generator is asked to rewrite it into separable, independently-citable sentences.

The extraction step — sentence to claims — runs on a small LLM with a structured-output prompt. We tried a rule-based claim extractor first. It worked on 60% of sentences and failed in ways that were hard to debug. The LLM extractor works on about 94% of sentences and fails in ways that are obvious to a reviewer.

Numeric-fact verification

New. A specialized check that fires when the entailment LLM flags a claim as numeric. The check parses the number out of the generated sentence, parses the number out of the source span, and does an exact match. No fuzzy matching, no unit conversion, no rounding tolerance.

This is a hack that replaced a much more ambitious plan. Our original design involved a symbolic numeric-reasoning layer that could handle “we reduced latency by 30%” against a source that said “latency went from 400ms to 280ms.” The symbolic layer existed in prototype and worked about 70% of the time. We deleted it. The exact-match check is right 99.8% of the time on numbers it can see and refuses cleanly on numbers it can’t. That’s a better contract than 70% plausible-match.

The operator consequence: sometimes a sentence gets refused because the number in the source is “400ms → 280ms” and the generator wrote “30% faster.” The reviewer picks one wording and re-drafts. This is fine. “Fine” here means less invisible-wrong-answer risk than the alternative.

What we removed

The attribution heatmap UI we shipped in month three and pulled in month eight. We wrote about it in the pillar post. Reviewers read the heatmap as a proof of grounding when it was actually a similarity display. Pulling it forced us to put the verification signal on the refusal path instead of the UI: if the system can’t ground a sentence, it refuses, and the operator sees the refusal. If it can ground, the citation is there. No middle state.

The evaluation pipeline, one year on

The eval pipeline is where the compounding shows up.

The gold set, a year ago

Day 26 of last year: we had 80 gold questions. We wrote about the initial curation in how we curate the golden set. The questions were hand-authored by two proposal veterans against three anonymized corpora — a healthcare IT shop, a federal services vendor, and a mid-market security vendor. Each question had one or more “must-cite” block IDs, and a reference answer that any correct response had to entail.

80 questions let us do directional measurement: this change helped, this change hurt. 80 questions is not enough to measure small deltas and is not enough to catch regressions on rare question types.

The gold set, now

1,240 questions across 11 corpora. Still hand-authored, but now with structured annotation: question type (factual lookup, compound reasoning, discriminator claim, numeric, temporal), must-cite blocks, must-not-cite blocks, and a reference answer graded by a second annotator.

The corpus count matters more than the question count. Adding a new corpus — a state-government vendor, say — surfaces failure modes that don’t exist in the mid-market SaaS corpus. Federal acquisition language has its own grammar. Healthcare compliance has its own cross-reference patterns. Education procurement has its own evaluation rubrics embedded in the RFP itself. The gold set gets richer by adding domains, not by adding questions inside the same domain.

We wrote about the ongoing curation cadence in retrieval evaluation part 2. Short version: every shipped corpus gets 40 new gold questions in the first two weeks. Every production incident that surfaces a retrieval bug becomes a gold question before the fix ships. The set grows at roughly 20–30 questions a week on a steady state.

Directional metrics, still directional

We still report four core metrics on every eval run. Nothing fancy, nothing that a public RAG eval framework doesn’t already cover.

Recall@10 on the must-cite blocks — does the retriever surface the right sources?
Answer faithfulness — is every sentence in the generated answer entailed by a cited source?
Answer completeness — does the generated answer cover every fact that the reference answer covers?
Abstention precision — when the system says “I can’t answer from the KB,” is that actually the right call?

Public frameworks — RAGAS, ARES, TruLens — each implement versions of these. We use our own implementations because our entailment LLM is a moving part we want to control, but anyone starting from zero should run one of the three frameworks as a baseline before building their own.

What we report to ourselves and don’t report externally: per-question-type breakdowns. Recall@10 on compound claims is lower than recall@10 on factual lookup. Answer faithfulness on numeric claims is higher (because of the exact-match numeric check) than on qualitative claims. Reporting one aggregate number hides which question types are regressing.

Regression guards in CI

Then. We ran the eval set manually, roughly weekly, and eyeballed the output.

Now. Every merge to main runs a 200-question subset of the gold set inside CI. The subset is stratified across question types. The job reports four metric deltas against the last main commit. If any metric drops by more than a threshold (2 points on recall@10, 3 on faithfulness, 4 on completeness, 1 on abstention precision), CI fails.

The thresholds are intentionally generous in one direction and tight in the other. Answer faithfulness is the one we defend hardest — a regression there is a grounding failure, which is the thing we are not allowed to ship. Completeness and recall are allowed to wobble a little; if a change improves faithfulness at the cost of some completeness, that’s usually a good trade.

The full 1,240-question set runs nightly on main. The nightly job produces a report that gets pasted into the engineering channel. Slow metrics we care about but don’t block on — median cost per answer, P95 draft latency — land in that nightly report as well.

Open problems, one year later

What we thought would be easy and wasn’t:

Compound claim decomposition. We thought sentence-to-claims would be a trivial LLM pass. It’s not. The extractor is 94% right, which means 6 claims out of 100 are either missed, split wrong, or hallucinated into existence. The six wrong claims produce verification results that are meaningless, and they do it silently. We’ve moved to a best-effort model: the extractor’s output is treated as advisory, and the ground truth is still “does the sentence as a whole survive entailment against the cited sources?” Which is exactly where we started, with an extra signal bolted on.

Synonym failures in retrieval. A buyer asks “what’s your business continuity plan?” The KB has the answer under “disaster recovery.” Dense retrieval mostly finds it. Sparse retrieval doesn’t. Compound queries (“BCP and DR posture”) sometimes find both and sometimes find neither. We thought this was a solved problem with a good embedding model; it isn’t. The current workaround is a synonym dictionary maintained by the KB owner. We want this to be learned. It isn’t yet.

What we thought would be hard and was straightforward:

Cost control. A year ago we spent real engineering time on cost projections for grounded retrieval at scale. Embedding costs, LLM costs, reranker costs. They all turned out to be fine. At steady-state the cache dominates; at spiky loads the per-request cost is a tenth of what we modeled. We wrote about this in cost per response breakdown. The reason we got this wrong is that we modeled against list prices and against worst-case cache misses. Reality: caches work, batch discounts are real, and embedding costs got cheaper twice during the year.

Multi-provider fallback. We built fallback logic a year ago that routes around a single-provider outage. In practice we’ve used it twice. The code is simple, the test case is simple, the provider APIs are stable enough that a well-formed request to one is easily retargeted to another. The complexity we were afraid of didn’t materialize.

What’s still genuinely hard:

Numeric precision without losing recall. The exact-match numeric check is high precision and low recall: it refuses some claims that are semantically fine but not textually identical. We accept the refusal-happy behavior because it’s safer than the alternative. A learned numeric-reasoning layer that could handle “latency went from 400ms to 280ms” and “30% faster” as equivalent would be a genuine improvement. We have prototypes. None of them are good enough yet.

The “I don’t know” boundary. When is it right to abstain? A reviewer doesn’t want a system that refuses to answer any question that requires synthesis across two blocks. A reviewer also doesn’t want a system that will cheerfully synthesize across two blocks it shouldn’t have combined. The threshold between these is workload-dependent, and we don’t yet have a principled way to set it per-customer. For now it’s a tuning knob with a sensible default.

What’s in the research branch

A few things we’re working on for year two, ordered by how likely we are to ship them.

Multi-block entailment. Today entailment is checked against a single source span. Some proposal claims are legitimately supported only by combining two blocks. A naive multi-block entailment check is trivial to implement and trivial to abuse — the system could claim support by cherry-picking two blocks that, read together, seem to back a claim that neither alone does. We’re working on a constrained version: multi-block support is allowed only when the claim’s argument structure requires it, and the constraint is checked by an LLM pass. Early results are encouraging, shipping is not imminent.

Learned rerankers. The BGE cross-encoder is trained on a generic dataset. A domain-tuned reranker trained on our gold set’s positive and negative pairs should outperform it on our workload. We’ve trained one. It does. It beats the generic reranker by about 4 points on recall@10 on the gold set, and it does worse on one specific out-of-distribution corpus. We’re not shipping until we understand the out-of-distribution behavior; the risk of a reranker that’s good on workloads it’s seen and worse on ones it hasn’t is exactly the regression behavior we don’t want.

Adaptive chunk sizes. Today every corpus gets 800-token chunks. Short-block corpora (security questionnaires) would benefit from 300-token chunks; long-prose corpora (executive summary libraries) might benefit from 1200. The research question is whether the chunk size can be learned from the corpus rather than set per-corpus by a human. We have a prototype that works on two of three test corpora. It fails on the third in a way we don’t yet understand.

A per-tenant eval slice. The gold set today is one pooled set. A customer with an unusual corpus — say, a civic procurement vendor working primarily with state-level RFPs — sees different failure modes than the pooled set suggests. The plan is to build a thin “customer gold set” framework that lets a customer annotate 40 questions against their own KB and gets metrics back specific to their workload. Engineering work is scoped; it’s fighting for prioritization against the items above.

Closing

A year ago we shipped a Pledge and a retrieval stack and argued that grounded retrieval was a contract, not a feature. Everything about that still holds. The code has changed; the discipline hasn’t.

If you’re starting this work from zero, the three posts to read in order are the Pledge in code, the grounded retrieval pillar, and the retrieval evaluation pillar. Everything else is a footnote on those three.

The one thing we’d tell our year-ago selves: ship the reranker on day one, skip the attribution heatmap, and start the gold set bigger than you think you need. Everything else we’d do again.

We’ll write the year-two retrospective in February 2027. The pile of things we don’t yet know how to measure is bigger than it was a year ago, which is probably the correct direction.