In preview: per-answer quality score with breakdown
A four-dimensional quality score — clarity, grounding, compliance, brevity — is rolling out in preview on drafted answers. How the score is computed, where it lives in the UI, and what it changes about the review pass.
Rolling out in preview today: every drafted answer can now carry a four-part quality score. One number from 0–100, broken out into four sub-scores: clarity, grounding, compliance, brevity. The score is opt-in per workspace while we tune it; the Proposal Builder page does not list it as a default capability yet.
The score lives in the answer header, next to the draft’s citation count. Clicking it opens a panel with the breakdown.
What each sub-score measures
Grounding (weighted 40%). The share of substantive claims in the answer that resolved to a verified citation during the claim-verification pass (how that works). A grounding score below 80 means at least one claim was written but not verified against a source block. The panel lists the unverified claims by sentence.
Compliance (weighted 30%). For RFP answers, whether the answer addresses every “shall” or “must” statement in the associated compliance-matrix row. For DDQ answers, whether the answer is within the buyer’s format requirements (word count, structured-answer schema, attachment references). Compliance below 90 means at least one requirement in the row was not demonstrably addressed.
Clarity (weighted 15%). A readability measure calibrated against the APMP evaluator-reading standard: sentence length distribution, passive voice rate, and proportion of category jargon relative to plain-language anchors. Clarity above 80 is an answer an evaluator can read once and understand.
Brevity (weighted 15%). Word count against a target. The target varies by answer type: 40–80 words for a simple DDQ yes/no with justification, 120–250 for a standard RFP subsection, and up to 500 for an executive-summary paragraph. A brevity score below 70 means the answer is more than 1.3x the target word count.
How the score gets used
Three flows changed the moment this shipped:
Reviewer triage. The review queue now sorts by composite score ascending. The lowest-scoring answers surface first. A reviewer with 45 minutes clears the bottom quartile before touching anything else.
Filter-by-score in the color-team review. Red-team reviewers can filter the document view to “answers with grounding < 90” or “answers with compliance < 95.” No more reading the whole draft looking for the weak spots.
Export gating (optional). An administrator can require a minimum composite score before an answer can be included in an export. Default: off. Turned on by three tenants within the first six hours.
Where the score is wrong
The four sub-scores do not add up to “this is a good answer.” An answer can score 96 and still miss the win theme. An answer can score 68 because it’s deliberately terse (“Yes. See Section 3.2.”) and be correct. We know.
The score is a triage tool, not a verdict. The panel explicitly says so. We’ve watched teams use a previous confidence score as a rubber-stamp; this one is designed to surface issues, not certify quality. A human reviewer is still the verdict.
What we’d still change
The brevity weighting is our best guess. We’ll publish an update in 60 days with the actual distribution and rebalance if the data demands it. The composite formula is the conservative version; if three weeks of real usage shows a particular sub-score is uninformative, we’ll drop it. Consider this v1, shipped, and instrumented.