When two citations disagree: how the draft resolves it
Two KB chunks say different things about the same claim. The conflict-resolution logic that decides which one the drafted answer cites — when to prefer newer, when to prefer higher-authority, and when to refuse.
A proposal KB is not internally consistent. It cannot be. The KB is the accretion of every past proposal, every past DDQ answer, every product-marketing paragraph, every security-questionnaire response the company has ever written. Different authors wrote different paragraphs at different times with different levels of authority, and some of them disagree with each other.
When the retrieval step pulls two chunks that disagree on the same claim, the drafting pipeline has to decide what to do. This is how the decision works.
Detecting the disagreement
Disagreement detection runs on the top-K retrieved chunks before the drafting step. The check is a targeted cross-encoder pass that asks, for each pair of chunks above a threshold relevance score, whether the chunks make compatible claims about the question being answered. “Compatible” is defined narrowly — the check looks at factual assertions (numbers, dates, named features, named customers) and flags when two chunks assert different values for the same attribute.
The check is deliberately narrow. A check that flagged every stylistic difference between two chunks would flag too many false positives and slow the pipeline down. The narrow check has a higher false-negative rate — it misses some real disagreements that present as subtle rephrasing — but in practice the false negatives are recoverable (the drafter ends up citing one source and not the other) while the false positives would break the experience.
When no disagreement is flagged, the drafter proceeds with the top-K chunks as usual. When a disagreement is flagged, the resolver runs.
The resolver — three signals
The resolver picks which chunk to cite when two chunks disagree. It uses three signals, in order.
Signal one: recency. Each chunk carries a timestamp from the KB content record: when the underlying document was last updated. The resolver prefers the newer chunk by default, with a soft threshold — chunks within 90 days of each other are treated as contemporaneous for this purpose and recency is not a tiebreaker.
Recency is a good default because most disagreements we see in customer KBs are stale-content issues. A 2024 chunk says the product supports SOC 2 Type I; a 2025 chunk says it supports SOC 2 Type II. The correct answer is the newer one. Recency handles this case without needing to reason about the semantics.
Signal two: authority. Each chunk is tagged with an authority level based on its source document. Proposal templates and security-questionnaire answer libraries that have been reviewed and approved by the customer’s compliance team are high-authority. Historical proposal responses from a specific past bid are medium-authority. Draft chunks, meeting notes, and SME emails captured during a capture cycle are low-authority.
When two chunks are contemporaneous (within the recency threshold) or when the user has explicitly asked for an authority-first resolution, the resolver prefers the higher-authority chunk. Authority is the right signal when the disagreement is not about staleness — it is about which source speaks for the company.
Signal three: unresolved — refuse. When recency and authority do not resolve the disagreement (contemporaneous chunks at the same authority level saying different things), the resolver does not pick. It refuses the question and surfaces both chunks to the user, labeled, with a short explanation of the conflict.
The refusal is a feature. We considered — and rejected — a fourth signal that would have used the model to pick between contemporaneous, same-authority chunks by reasoning about the semantics. The reason we rejected it is the same reason the Grounded-AI Pledge exists: when the KB genuinely disagrees with itself, the user needs to know, and the user needs to decide. A silent pick is the failure mode we are specifically trying to prevent.
What the user sees
The refusal surface for a citation conflict tells the user three things: that a conflict was detected, which claim disagrees with which, and what the user can do next. The options on the surface are to pick a source (the user explicitly chooses which chunk to cite), to correct the KB (fix whichever chunk is wrong and re-run), or to escalate to an SME (flag the conflict for the subject-matter expert who owns the content).
In practice most conflicts get resolved by the user picking a source in the moment and then flagging the KB for follow-up. The follow-up is important — a conflict that is not cleaned up will fire again on the next question that retrieves the same chunks — but it is cleanup, not blocking work.
Metrics
Two metrics that live on the dashboard for this.
Conflict-detection rate. The fraction of drafting requests where the cross-encoder check flags at least one conflict in the top-K chunks. A healthy range for a maturing KB is about 2% to 6%. A rate above 10% usually indicates a KB that has not been pruned in a long time. A rate near zero usually indicates a KB that is narrow enough that disagreements cannot surface — often a sign that the KB scope is too small, not that the content is consistent.
Refusal-on-conflict rate. The fraction of detected conflicts that end up in a refusal rather than being resolved automatically by recency or authority. A healthy range is 15% to 25%. Much higher means the KB is underlabeled (authority and timestamps are missing), much lower means the automatic resolution is more aggressive than it should be and the pledge is leaking.
What this costs
The cross-encoder check adds about 40 to 60 milliseconds to the drafting path per request, depending on K and on how many pairs above the threshold need to be checked. We considered moving the check to background with a post-hoc flag, decided against it, and moved it to the synchronous path. The latency cost is worth the user-experience cost of showing a citation that the system knew was in conflict before it drafted.
What is still open: detecting disagreement across more than two chunks. The current implementation is pairwise, which is correct for most cases but misses the occasional three-way conflict where chunks A and B agree and chunk C disagrees with both. We have a sketch for a transitive-closure variant but have not shipped it yet. The frequency of three-way conflicts is low enough that the cost of shipping it has not cleared the bar.