Field notes

Security questionnaires: linking answers to evidence

How a SOC 2 attestation PDF becomes a citation source for DDQ answers. The ingest pipeline, the per-control extraction, and the per-claim linking that makes 'yes' answers verifiable instead of theatrical.

The PursuitAgent engineering team 5 min read Procurement

A “yes” on a DDQ is theater unless the answer points to evidence. “Yes, we encrypt data at rest” is a claim. “Yes, we encrypt data at rest, AES-256, attested in our SOC 2 Type 2 report Section CC6.7, dated 2025-04-15” is a verifiable answer. The first is what most vendors ship. The second is what buyers actually want.

This post is the engineering note on how a SOC 2 attestation PDF (or an ISO 27001 certificate, or a HITRUST report, or a CMMC assessment) becomes a citation source the drafter can attach to DDQ answers. The pipeline has three stages: ingest, per-control extraction, and per-claim linking.

Stage 1 — ingest

Attestation PDFs are not normal PDFs. SOC 2 reports are typically 60 to 120 pages. They have a standard structure (Trust Services Criteria, control mappings, auditor’s opinion, management assertion, control test results) but they vary in formatting. Some are professionally typeset; some are exported from compliance tools and have layout artifacts.

Our ingest uses LlamaParse for the heavy text extraction (the same path described in the LlamaParse ingest pipeline post) plus a SOC-2-specific structuring step. The structuring step is a small LLM pass that maps the parsed text to a known schema:

type Soc2Report = {
  reportType: "SOC2_Type_I" | "SOC2_Type_II";
  auditPeriod: { start: Date; end: Date };
  auditor: string;
  managementAssertion: string;
  trustCriteria: TrustCriterion[];
  controls: ControlEvidence[];
};

type ControlEvidence = {
  id: string; // e.g., "CC6.7"
  description: string;
  testProcedure: string;
  testResult: "no_exceptions" | "exceptions_noted";
  exceptions?: string;
  pageReference: number;
};

The schema is conservative — we extract what is consistently present across vendors’ reports and skip the parts that are formatted differently across reports. The output is a structured record per attestation that the rest of the pipeline can query.

Stage 2 — per-control extraction

Each control in the attestation becomes a KB block. The block schema:

  • block_type: "attestation_control".
  • framework: "SOC2" | "ISO27001" | "HITRUST" | "CMMC".
  • control_id: e.g., "CC6.7" for SOC 2, "A.10.1.1" for ISO 27001.
  • description: the control’s description verbatim.
  • test_result: passed / exception.
  • exception_text: present only if exception.
  • page_reference: which page of the source PDF contains this control’s results.
  • attestation_period: the audit period the control was tested over.
  • source_doc_id: pointer to the original PDF, with version.

The blocks are versioned per attestation period. A new SOC 2 Type 2 report (typically annual) creates a new generation of blocks; the old generation is marked superseded but retained for historical questions. Freshness is important here — we covered the shipped freshness scores feature in May, and attestation blocks sit at the strictest end of the freshness gradient. A SOC 2 control older than 14 months is flagged as stale even if it has not been replaced.

Stage 3 — per-claim linking

A DDQ question like “describe your encryption at rest controls” arrives. The drafter’s retrieval pulls candidate blocks. The candidates include:

  • The narrative block describing our encryption posture (if one exists in the KB).
  • The relevant SOC 2 controls (CC6.7, CC6.1, depending on which the auditor used).
  • Any ISO 27001 controls if we hold an ISO 27001 cert (A.10.1.1, A.10.1.2).
  • Any policy or procedure documents that describe the implementation.

The drafter assembles the answer with the narrative block as the prose source and the attestation controls as the per-claim citations. Each substantive claim in the answer (“we use AES-256,” “keys are rotated quarterly,” “the controls were last tested by [auditor]”) gets a citation pointer to the relevant attestation control or policy block.

The output, rendered in the proposal builder, looks like a paragraph with the citation footnote markers inline. Click any marker, the verify panel opens (the same panel from the inline verify button post), the source block renders with the matched span highlighted, and the auditor, attestation period, and page reference are all visible.

What this changes for the buyer

Three things, in order.

The buyer sees the evidence at the same time as the answer. They do not have to ask for the SOC 2 report separately and cross-reference it manually. The citation links to the report directly, opens to the right page, highlights the relevant control.

Stale evidence becomes visible. If our SOC 2 expired six months ago and no new report has been ingested, every answer cited against the expired controls renders with a freshness warning. The buyer knows what they are reading.

Fabrication becomes harder. Safe Security’s piece on the security questionnaire surface named the failure mode: vendors recycle outdated answers, buyers collect reassuring “yes” responses that do not reflect real posture. Per-claim linking forces the recycled answer to either re-ground in a current attestation or be flagged. The default-fail mode is the right default.

What is still hard

Two things.

Industry-specific framing of the same evidence. The vendor onboarding DDQ teardown covered the four-industry variation; the same SOC 2 control evidence supports answers in finance, healthcare, SaaS, and defense, but the framing language and the supplementary citations differ. Our drafter currently picks the right industry framing about 85% of the time on the bids we run; the remaining 15% requires manual editing. We are working on per-industry retrieval boosts to close that gap.

Multi-attestation reconciliation. A vendor that holds SOC 2, ISO 27001, and HITRUST has overlapping control evidence — the same underlying control appears in three frameworks under three IDs. Reconciling them so the drafter cites the right framework’s control for the right buyer is a routing problem we have not fully solved. Today the drafter cites the SOC 2 control by default and the writer redirects manually if the buyer’s framing wants ISO or HITRUST. This is fine for now and not the right long-run answer.

For the broader picture of how citation grounding works in the drafter, the canonical post is the Grounded-AI Pledge in code. For the buyer-side perspective on why this matters, see Procurement-side pain is real from earlier this week.

Sources

  1. 1. Safe Security — Vendor security questionnaire best practices
  2. 2. Loopio — Best DDQ software
  3. 3. PursuitAgent Grounded-AI Pledge