Compliance extraction, revisited
The grammar we moved to for requirements extraction, why we stopped treating 'shall' as a single class, and the evaluation showing a 38% drop in false-positive requirements.
Our compliance extractor turns a 120-page RFP into a compliance matrix in about four minutes. For most of the first year of the product, it worked on a simple grammar: find every sentence containing a compliance verb (shall, must, will provide, is required to), classify the subject, and emit a row.
That grammar had a false-positive rate of ~22%. Every fifth row in the matrix was wrong — not a real requirement, or a requirement the vendor didn’t have to answer. Reviewers spent real minutes culling.
This quarter we moved to a new grammar. The false-positive rate dropped to ~14%. This post is the change.
Why “shall” alone isn’t enough
Federal RFPs use shall in three operationally distinct ways:
- Vendor obligations. “The offeror shall provide a technical approach addressing…” Real requirement. Needs a matrix row.
- Buyer reservations. “The Government shall have the right to modify…” Not a vendor requirement. No matrix row.
- General contractual language. “The contract shall be governed by…” Contractual assertion, not a vendor response requirement.
The v1 grammar caught all three as requirements. The reviewer then deleted the buyer-reservation and contractual rows by hand. The false positive rate was driven almost entirely by this conflation.
The new grammar
The v2 grammar classifies shall sentences by subject and mood before emitting a row. Three passes:
Pass 1 — subject classification. A small classifier (fine-tuned Modern BERT, run locally, no LLM call) labels the subject of each shall clause as vendor, buyer, contract, third-party, or unknown. Only vendor emits a matrix row by default. buyer and contract get annotated but suppressed. third-party (e.g., “the integrator shall…”) gets a conditional row flagged for human review.
Pass 2 — compliance-verb expansion. We expanded the compliance-verb set from 4 phrases to 17, based on an analysis of 2,800 federal RFPs across 2024–2025. The additions are mostly forms of provide, submit, address, describe, demonstrate, and comply with — each with modal and tense variants. Phrases like “offerors are expected to include” now extract as requirements; before, they were silent.
Pass 3 — scope extraction. Each requirement row now has a scope field: technical, management, past-performance, pricing, administrative, or other. The scope is inferred from surrounding section context plus the requirement text. Scope lets us sort and filter the matrix for the color-team review without reading every row.
Evaluation
We evaluated on a held-out set of 140 RFPs with human-annotated compliance matrices. Numbers:
| Metric | v1 grammar | v2 grammar | Delta |
|---|---|---|---|
| Recall (vendor requirements caught) | 91% | 94% | +3 pts |
| Precision (extracted rows that are real) | 78% | 86% | +8 pts |
| False positive rate | 22% | 14% | -38% relative |
| False negative rate | 9% | 6% | -33% relative |
Both axes improved, but the bigger gain was in precision. Reviewers spend less time deleting bad rows than they did hunting for missed rows, so the precision improvement is the more valuable one.
A 14% false positive rate is still not zero. It’s never going to be zero; there is irreducible ambiguity in a document written by a contracting officer at 4 PM on a Friday. The target for v3 is ~10%, and we know where the remaining errors cluster.
Where the remaining errors live
Three clusters in the 14%:
Conditional requirements. “If the offeror proposes a cloud-hosted solution, the offeror shall…” The grammar extracts the consequent as a requirement even when the antecedent doesn’t apply. We’re working on a v3 pass that preserves the conditional structure and emits the row with an applicable_if annotation.
Cross-referenced requirements. “The offeror shall comply with the standards specified in Section L.4.” Section L.4 is a sub-document. The grammar extracts the surface requirement but the real requirements are in the referenced section. We need to resolve the cross-reference and inline the actual standards — on our roadmap, not in v2.
Ambiguous subject. About 3% of shall sentences have a subject that’s genuinely unclear: passive constructions where the subject is elided, or compound subjects where one subject is the vendor and one isn’t. The v2 classifier marks these unknown and flags them for human review, which is honest but also what drives the residual precision gap.
What changed in the product
The compliance matrix UI now color-codes rows by subject class. Vendor rows are the default row color; buyer/contract rows are demoted to a secondary panel; unknown rows appear in the primary view with an ambiguity flag.
The matrix view also gained a scope filter. A reviewer can look at just pricing requirements, or just past-performance requirements, which matches how the draft actually gets assembled.
The operational lift
Across the first two weeks of v2 production, customer-reported corrections to the compliance matrix dropped about 40%. The rework time on compliance matrices dropped proportionally. We haven’t tried to publish a time-saved number because it depends heavily on RFP length and how carefully the customer reviews, but the internal signal is clear: v2 holds up in production at roughly the evaluation rate.
The takeaway
Requirements extraction isn’t a single grammar. It’s three passes — subject, verb, scope — and the subject pass is where the biggest gains lived. Treating all shall sentences identically was a defensible v1 choice; treating them differently is a necessary v2 choice.