Grounded Retrieval 101, Part 3: the citation rendering stack

Part 1 of this series covered the retrieval step. Part 2 covered the verification step. Part 3 is the rendering step — what happens between a verified, entailed sentence and the inline citation that a reviewer in the proposal UI actually sees, hovers, clicks, and trusts.

The rendering layer is often dismissed as “just UI.” It is not just UI. The rendering layer is where the discipline of grounded retrieval becomes a thing the reviewer can verify with their own eyes. A verified retrieval hit that renders as an unverifiable footnote-blob is, from the reviewer’s vantage, indistinguishable from an ungrounded answer with a fake footnote. The rendering has to expose the verification.

Four components. We’ll walk through each.

Component 1 — the citation marker

The citation marker is the inline reference that follows a generated sentence in the proposal draft. It looks like a small numbered superscript — [1], [2], [3] — at the end of the sentence. A reviewer reading the draft sees the marker and registers, without having to do anything, that the sentence is sourced.

The marker is interactive. Hovering surfaces the hover card (component 2). Clicking opens the source viewer (component 3). The marker is also visually consistent: every cited sentence gets one, no cited sentence is missing one, and uncited sentences are visually distinct (we render them with a faint dashed underline that we’ll discuss below).

Two design rules.

First, the marker is rendered at sentence boundary, not paragraph boundary. A paragraph with three cited sentences shows three markers, not one. This is more visual noise than a single marker per paragraph. We accept the noise because the alternative — paragraph-level citations — implies that the whole paragraph came from one source, which is sometimes true and sometimes a useful lie. We’d rather have visible per-sentence honesty than tidy paragraph-level deception.

Second, the marker numbering is per-section, not per-document. A 30-page proposal would otherwise accumulate citation numbers into the 200s, which is unreadable. Per-section numbering keeps each section’s markers in the 1–30 range, matching how reviewers think about a proposal’s sections.

The dashed-underline treatment for uncited content is a deliberate signal. In our drafting UI, sentences that did not pass the verification gate — sentences a human writer added directly, sentences the model produced as connective tissue between cited claims — render with a dashed underline. The reviewer at gold-team can filter the document to show only uncited content. That filter is the gold-team rubric question “does any sentence in the response lack a source a reviewer can verify” answered automatically.

The hover card is the first level of inline verification. A reviewer hovers over a citation marker and a small popup appears, anchored to the marker, showing:

The source document name and page reference.
The exact span from the source block that entailed the sentence — highlighted within a small text excerpt of the surrounding source content.
The block’s last_verified_at timestamp.
A confidence indicator (the entailment verifier’s confidence on the sentence-block pair).

The hover card is the cheapest level of verification. It does not require the reviewer to leave the draft view. It does not require a click. It surfaces enough information that 80% of reviewer verifications happen at this level — the reviewer hovers, sees the cited span, agrees with the citation, and moves on.

Latency is critical here. The hover card has to render in under 150ms or reviewers stop using it. We pre-fetch the relevant source spans for every citation in the visible viewport when the draft loads. The hover card itself is a synchronous render against pre-fetched data; the network call happens in the background as the reviewer scrolls.

The confidence indicator deserves a note. It is a small colored dot — green, amber, red. Green means the entailment verifier passed at high confidence. Amber means it passed but at a confidence near the threshold, and a human review would be valuable. Red means the citation was overridden by an operator (which is logged — see component 4 — and therefore visible). Most cited sentences are green. Amber sentences are flagged for the gold-team review rubric. Red sentences are rare and require explicit acknowledgment from a reviewer to be left in.

Component 3 — the source viewer

Clicking the citation marker opens the source viewer. This is the deeper level of verification. Where the hover card shows a span, the source viewer shows the block in context: the full block, the blocks above and below it in the source document, the document’s heading hierarchy, and a “view in source PDF” button that opens the original document at the cited page.

The source viewer is what makes the citation auditable end to end. A reviewer who is not satisfied with the hover card can open the source viewer and read the surrounding content. A reviewer who is still not satisfied can open the original PDF at the cited page. At every level, the citation chain is testable — and a reviewer who finds a citation that doesn’t hold up has all the context they need to flag it.

The source viewer also exposes the block’s version. When a source document is re-uploaded or re-parsed, blocks are versioned. A citation that was correct against version 3 of a document may not be correct against version 5 — the cited paragraph may have been edited, or removed, or moved. The source viewer shows which version was cited and whether the current version still contains the cited content. If the current version differs, a banner appears: “This block has been edited since the citation was made. View original version.”

This “View original version” affordance is something we built after our second release. The first version of the source viewer always showed the latest version of the block, and a customer flagged that this could silently change a citation under a reviewer’s feet between drafts. The fix: citations bind to a block-version pair, not just a block ID, and the viewer shows whichever version was bound at draft time, with a notice if a newer version exists.

Component 4 — the audit log

The audit log is the least visible component and the most important for trust-over-time.

Every citation event is logged. When a sentence is drafted with a citation, the log records the question, the retrieval candidates, the chosen block, the verifier’s confidence, and the timestamp. When a reviewer hovers a citation, the log records it. When a reviewer overrides a citation (accepts a sentence the verifier flagged as low-confidence, or replaces a cited block with a different one), the log records the override, the reviewer’s identity, and the rationale (which the override flow requires the reviewer to provide).

The log is queryable from the proposal’s audit tab. A compliance reviewer at gold team can filter the proposal to “all citations overridden in this draft” and review each override individually. A customer’s internal audit team can pull the log at submission time and store it alongside the proposal record.

The audit log is also where the Pledge becomes contractually testable. The Pledge says we never ship a claim labeled as grounded when the claim isn’t. The log is the artifact that proves it — every citation in a shipped proposal traces to a verified retrieval event with a logged confidence score, a logged override (if any), and a logged reviewer approval.

What the rendering doesn’t do

Two things we deliberately do not render.

We do not render an attribution heatmap. We discussed this in the pillar piece: heatmaps are pretty and not causal. A heatmap that shades a sentence in colors corresponding to “how much this sentence drew from each retrieved chunk” is a similarity score, not a guarantee of grounding. Heatmaps confuse reviewers into thinking they have verified a citation when they have only verified a similarity. We replaced heatmaps with the citation marker + hover card + source viewer chain, which actually verifies.

We do not render confidence scores as numeric percentages. The amber/green/red indicator in the hover card is intentional: numeric confidence (“0.78 entailment confidence”) creates a false sense of precision and invites reviewers to make incomparable comparisons across sentences. The three-bucket indicator is honest: green is “verifier passed at high confidence, you’re likely fine,” amber is “verifier passed marginally, take a closer look,” red is “this was overridden, it needs your explicit sign-off.” That is what the reviewer needs. The underlying numeric score is in the audit log for the people who want it.

What’s still rough

The source viewer doesn’t render diagrams as well as it renders text. A citation to a diagram block — see the diagram extraction post — opens the source viewer, shows the diagram description, and renders the D2 diagram below it. But the cross-link to the original PDF page is less precise for diagrams than for text: the page reference is correct, but the highlighted span isn’t, because we don’t have a span for an image. We mark the diagram block as a whole instead.

Mobile is rough. The hover card pattern doesn’t translate to touch interfaces. We render markers as tap-to-expand on mobile, which is functional but slower than desktop hover. A meaningful share of gold-team reviews happens on tablets and phones — proposal managers reviewing on the go — and we have not yet shipped the mobile parity work that would make the experience as fluid on touch as it is on desktop. It’s on the roadmap; the timeline is honest-not-published.

Why rendering matters

A reviewer’s trust in a grounded-AI system is built or broken in the rendering layer. The Stanford HAI study on commercial legal RAG documented that practitioners reviewing AI-generated legal answers caught hallucinations at meaningfully different rates depending on how the citations were rendered. A citation rendered as a tooltip with an inline excerpt got verified more often than a citation rendered as a footnote-style link at the bottom of the page. The rate of caught hallucinations correlates with the friction of verification.

Our rendering stack is built around this finding. Hover card for the cheap check. Source viewer for the deeper check. Audit log for the durable record. Confidence indicator to triage where reviewer attention is most needed. Each component reduces the friction of verification at one level, so reviewers actually verify, instead of rubber-stamping.

Part 4 of this series, landing in the next two weeks, covers refusal flows — what happens when the verifier rejects a sentence and how the UI helps the operator decide what to do next.