Field notes

Security questionnaires: the 80% that's really retrieval

The canonical Engineering pillar on DDQ automation. A 300-question security questionnaire is not 300 unique questions — it's mostly retrieval against a corpus that's already written, plus a small tail that isn't.

The PursuitAgent engineering team 17 min read Procurement

A 300-question security questionnaire is not 300 unique questions. It is mostly retrieval queries against a corpus your security team has already written, plus a smaller tail of net-new answers that require a subject-matter expert to think for five minutes. Our rule of thumb is that the split runs roughly 80/20 — that is our own operational observation from working through questionnaires with our customers, not a measured industry benchmark. The ratio varies by buyer and by framework; the shape (big retrievable majority, smaller SME-bound tail) is what generalizes.

This is the pillar on how we handle security questionnaires inside PursuitAgent. It is written from the engineering side — the retrieval layer, the verification pass, the escalation path for the tail. If you are looking for the craft essay on how to staff a security-questionnaire team, Sarah wrote that one yesterday. This post is about the machine underneath.

The 80/20 is the thing that changes how the work is staffed and tooled. Miss the split and you either over-automate (auto-answer everything, watch 30 of those answers be wrong, and lose the deal) or under-automate (send every question to an SME, burn 40 hours per questionnaire, and ship late). The right move is to automate the 240, escalate the 60, and verify every one of the 240 before a claim leaves the queue.

What the volume actually looks like

Safe Security published data in 2024 showing that a mid-market SaaS vendor with regulated customers receives 500 or more security questionnaires per year. The median questionnaire they measured had between 200 and 400 questions. A vendor at that volume is answering something like 150,000 questions per year.

Nobody answers 150,000 unique questions per year. If you open 30 questionnaires from 30 different buyers, you will find that the questions cluster around a small number of frameworks: SOC 2 (the 2017 trust services criteria, updated 2022), ISO 27001, CAIQ, SIG, HIPAA where relevant, PCI-DSS where relevant, and a handful of state-specific addenda. Within those frameworks, the questions are paraphrases of each other.

“Do you encrypt data at rest?” and “Is customer data encrypted when stored?” and “Describe your at-rest encryption posture, including algorithm and key management.” are the same question, asked three ways, answered the same way: “Yes. AES-256. Keys managed via AWS KMS with rotation every 90 days. Evidence: SOC 2 report, section CC6.1.”

Loopio has written before that the average security questionnaire takes 15 to 40 hours to complete. Their number matches ours. Their number also hides the split: of those 15 to 40 hours, something like 12 to 32 are spent on the 80% that repeats, and three to eight are spent on the 20% that doesn’t. The retrieval-bound portion is the one worth automating because it is structurally repeatable. The tail is where SMEs earn their paychecks.

The 80% that repeats

Here is a worked example of 12 typical questions from a real SOC 2 Type II-aligned questionnaire, lightly reworded to strip a specific buyer’s voice. Each question maps to a KB block that was written once — usually during the customer’s last audit cycle — and has been cited dozens of times since.

#QuestionKB block categoryOwner
1Do you maintain a written information security policy?policy/isp-overviewSecurity
2Encryption at rest algorithm and key management?control/encryption-at-restSecurity
3Encryption in transit protocols (TLS version, ciphers)?control/encryption-in-transitSecurity
4Describe your incident response process.process/ir-runbookSecurity
5Business continuity and disaster recovery RPO/RTO?policy/bcp-drSecurity
6Vendor risk management process.process/vendor-risk-assessmentGRC
7Data residency options for customer data.control/data-residencyLegal + Security
8Access control model (RBAC, MFA, SSO)?control/access-controlSecurity
9Penetration testing cadence and last report date.evidence/pentest-cycleSecurity
10Employee security training (frequency, content, attestation).process/security-trainingHR + Security
11SOC 2 Type II report, date of last attestation.evidence/soc2-attestationGRC
12Subprocessor list and notification process.control/subprocessorsLegal

Twelve questions. Twelve blocks. Each block was written once. Each block has a version pinned to the most recent attestation cycle. Each block carries a freshness score that flags it for re-review ahead of the next audit. The retrieval job on these 12 questions is to map the buyer’s wording to the canonical block and emit the cited answer.

That is not a hard retrieval job. It is an embedding lookup plus a reranker plus a verification pass. The embedding lookup surfaces 20 candidate blocks per question. The reranker picks the top three. The verifier checks that the candidate block entails the question’s answer, and if it does, the answer is drafted with citation attached. We have written about the verification step and the reranker in more depth before.

Retrieval precision on security-questionnaire queries is higher than on general RFP queries on our internal eval — precision@1 sits in the low-to-mid 0.8s across all RFP-domain queries; on the security-questionnaire sub-gold-set, it runs a few points higher. The difference is that security questions are more templated. Fewer novel phrasings per question. Less reliance on buyer-specific jargon. The gap is directional on our corpora; it would not necessarily reproduce on an external setup.

The 20% that doesn’t

The tail is where the questionnaire becomes interesting. Six kinds of questions tend to end up here:

1. Buyer-specific contractual asks. “Will you agree to a 72-hour breach notification window in our MSA?” That is a legal question, not a security question. It needs the legal team. The KB might have a reference answer — “we default to 96 hours, negotiable to 72 where required by regulation” — but the actual answer depends on the deal.

2. Novel control questions. “Describe your use of large-language-model providers and how customer data is excluded from training.” Three years ago, that question did not exist in any framework. The first time a security team sees it, nothing in the KB covers it. Someone has to write a canonical answer, add it to the KB, and make sure it is cited against the customer’s data-handling agreements. From the second time forward, it is part of the 80%.

3. Architectural questions tied to specific customer deployments. “What is the latency impact of your encryption layer on our specific workload?” The KB has a general answer. The specific answer requires a solutions engineer to run a load test.

4. Questions about the customer’s own environment. “Can your product be deployed in an air-gapped environment with no outbound network access?” The KB has a general answer about on-prem support. The specific answer depends on which integrations the customer needs.

5. Attestation questions where the evidence has expired. “Attach your most recent SOC 2 Type II report.” The KB has a reference to the attestation; the attached PDF may be 14 months old. A competent workflow surfaces the expiration, triggers a refresh request to the audit firm, and blocks the auto-answer until the new PDF is attached.

6. Questions the buyer wrote specifically to trip up vendors. Some questionnaires include intentionally tricky questions — “What is the average time-to-detection for insider threats in the last 12 months, broken down by severity tier?” — that are designed to separate vendors with real data from vendors with generic answers. These escalate to a security engineer who has access to the metrics.

The escalation path matters as much as the retrieval path. A security-questionnaire workflow that can’t escalate cleanly will either ship a fabricated answer on one of the tail questions or block on an SME who doesn’t know they are on the critical path. Our ticketing layer (shipped in Q2) was built specifically for this. The questionnaire workflow has a 48-hour SLA on SME tickets, and the SLA clock starts when the retrieval layer flags the question as escalation-bound.

The retrieval layer

A security KB block is not the same shape as a marketing or RFP block. Four differences matter.

Owner. A marketing block is owned by marketing. A past-performance block is owned by the capture lead on the original deal. A security block is owned by the security team, usually a single GRC lead, and the block’s owner is encoded in the metadata. When the block is cited, the workflow knows who to notify if the buyer pushes back on the answer.

Versioning cadence. Marketing blocks change when marketing changes. Security blocks change when the underlying control changes — which is usually once per audit cycle, annually or semi-annually. A security block carries a version that maps to an attestation date. The version is pinned. The block text cannot silently drift without triggering a re-review.

Evidence attachments. A security block is not just text. It has attachments — SOC 2 reports, pentest reports, SIG Lite submissions, ISO certificates — that need to travel with the answer when it’s cited. The attachment system we shipped last week handles this: when the retrieval layer picks a block with an attached evidence PDF, the PDF is automatically staged for the response package.

Freshness scoring. Marketing blocks age gracefully. Security blocks age badly. A block that cites last year’s pentest report is actively wrong the day the new pentest completes. The freshness score (shipped Q2) compares the block’s last-updated timestamp to the attestation cycle, and if the block is within 30 days of expiry, the retrieval layer flags it as stale and blocks the auto-answer.

Here is the shape of a security-question retrieval call inside the product, simplified:

async function answerSecurityQuestion(
  question: string,
  customerId: string,
  questionnaireId: string,
): Promise<AnswerResult> {
  // 1. Embed and retrieve top-k candidate blocks.
  const candidates = await kb.retrieve({
    query: question,
    customerId,
    filter: { pillar: "security", status: "published" },
    topK: 20,
  });

  // 2. Rerank against the specific question wording.
  const ranked = await reranker.score(question, candidates);
  const top3 = ranked.slice(0, 3);

  // 3. Freshness check. A stale block can't auto-answer.
  const fresh = top3.filter((b) => b.freshnessScore >= FRESHNESS_FLOOR);
  if (fresh.length === 0) {
    return escalate(question, "stale_evidence", top3[0]?.ownerId);
  }

  // 4. Entailment verification. Does the top block actually
  //    answer the question?
  const best = fresh[0];
  const entails = await verifier.entails(best.text, question);
  if (!entails || best.confidence < CONFIDENCE_FLOOR) {
    return escalate(question, "low_confidence", best.ownerId);
  }

  // 5. Draft with citation, attach evidence.
  const draft = await drafter.draftAnswer(question, best);
  const evidence = await attachments.stage(best.evidenceRefs);

  return {
    kind: "auto_answered",
    answer: draft.text,
    citation: { blockId: best.id, version: best.version },
    evidence,
    confidence: best.confidence,
  };
}

Five steps. The first two are standard retrieval. The third is the freshness guard — specific to security blocks. The fourth is the entailment check — the invariant that we covered in the grounded-retrieval pillar. The fifth produces the auto-answered draft or escalates to the block’s owner. Nothing clever. The cleverness is in the KB schema underneath.

The CONFIDENCE_FLOOR is the tunable parameter we wrote about separately. We set it high for security questions. The cost of a wrong auto-answer is larger than the cost of an unnecessary escalation. The floor we use for security questionnaires is meaningfully higher than the floor we use for generic RFP questions — by roughly 15 points on our verifier scale. That gap reflects the difference in review discipline between a security-questionnaire workflow (tight cycle, limited review) and an RFP workflow (pink, red, gold review passes).

Verification, not generation

This is the part that differentiates a grounded security-questionnaire workflow from a generic LLM one. LLMs are good at drafting answers that look right. They are bad at verifying that the drafted answer actually matches the cited source. The Stanford HAI study put hallucination rates in production RAG systems between 17% and 33% even with citations wired in. AutogenAI’s own engineering team has written about the limits of LLM-native proposal generation when verification isn’t built in.

For security questionnaires, that rate would destroy trust. A security-questionnaire answer needs to be verifiable at the clause level. Every substantive claim — every number, every control, every commitment — needs to entail from a specific KB block that the security team has blessed.

Numeric facts are the most sensitive. A question about RTO (recovery time objective) has a specific number attached — “four hours” — and that number is either right or wrong. It cannot be paraphrased. A drafter that generates “our RTO is approximately four hours under normal conditions” when the KB says “four hours” has added a hedge word that the SOC 2 report doesn’t support. The verifier catches this because the verifier is doing exact-match on numeric claims, not embedding similarity.

We decompose every emitted sentence into atomic claims and verify them independently. For the question “What is your RTO?” the answer “Our RTO is four hours.” decomposes into one claim (RTO = 4 hours) and is verified against the KB block’s numeric field. For the question “Describe your encryption posture.” the answer is a paragraph with five or six claims (AES-256, KMS, 90-day rotation, TLS 1.3, ciphers listed) and each one is verified independently. A sentence that passes four of five claim checks does not ship — the verifier blocks it and escalates.

This is more expensive than just drafting. It is also the only way to meaningfully close the gap from Stanford’s 17-33% hallucination range toward near-ceiling. We track claim-level entailment rate on the security-questionnaire sub-gold-set separately from the general rate on our internal eval. The security sub-rate runs a few points higher than the general rate; the gap is the higher confidence floor doing work. Both numbers are directional and specific to our corpora, not a general benchmark.

The rest of the gap — the last 3% — is where the 20% tail lives. Those are the questions the verifier correctly refuses to auto-answer. The right response to a refusal is not to lower the floor until the number hits 1.0. The right response is to escalate.

Where the 20% breaks

When the verifier refuses to auto-answer a question, the workflow routes the question to the block’s owner. That sounds simple. In practice it is the most operationally delicate part of the whole machine.

The SLA is 48 hours. That is not a PursuitAgent policy — it is the typical window a regulated-vendor sales team has between receiving a questionnaire and the buyer’s “please confirm response timeline” nudge. If the SME doesn’t answer in 48 hours, the response ships late or ships incomplete, both of which cost deals.

The ticket carries a tight context package: the original question, the closest retrieved blocks (even though they didn’t meet the confidence floor), the reason for the refusal (stale evidence, low confidence, novel pattern), and a suggested answer structure. The goal is to minimize the SME’s time per ticket. A well-structured ticket takes five minutes to resolve; a badly structured one takes 45.

Evidence attachments are the other failure mode. A question like “Attach your most recent SOC 2 Type II report and your most recent penetration test summary” requires two PDFs. The retrieval layer finds the references; the attachment layer stages them; but if the PDFs are stored in a disorganized shared drive, the staging fails and someone has to hunt. The evidence vault architecture we shipped to handle this is what makes the auto-attachment reliable. Without a vault, you are rebuilding the attachment list every questionnaire.

The third failure mode is the one that is hardest to automate around: buyers who wrote custom questions about your specific architecture. “How does your product handle the case where our customers’ end users need to access the system from inside a specific VPC peering arrangement?” That is not in the KB. It cannot be in the KB, because the KB doesn’t know about the buyer’s VPC topology. Someone has to read the buyer’s architecture diagram and write a custom answer. The workflow’s job is to detect these early — usually by flagging questions with proper nouns that don’t appear anywhere in the KB — and route them to the solutions engineer fast.

We have a long way to go on that last detection step. The current heuristic catches about 70% of the architectural questions. The other 30% slip through and end up in the auto-answer queue, get refused for low confidence, and then escalate — which is fine, but it adds 12 hours to the turnaround. Getting that detection rate up is on the Q1 roadmap.

The cost of the wrong split

A thought experiment. A vendor receives a 300-question security questionnaire. They have three options.

Option one: auto-answer everything. They configure their tool to draft all 300, no escalation. They ship in four hours. Of the 300 answers, 240 are correct, 30 are plausible but subtly wrong (the tail the verifier should have caught), and 30 are wrong enough that a careful reviewer on the buyer side will notice. The questionnaire comes back with clarifications, the deal slows, and the vendor loses a week to cleanup.

Option two: send everything to SMEs. They send all 300 to the security team. The team spends 40 hours. The questionnaire ships on time with high quality. The vendor’s security engineers, who should be doing security work, spent a week doing retrieval. Multiply by 400 questionnaires a year and the engineering org has six full-time equivalents funneled into questionnaire response.

Option three: the 80/20. Auto-answer the 240 retrieval-bound questions with verification. Route the 60 tail questions to SMEs with a 48-hour SLA. Total SME time: roughly six hours. Total analyst time: four to eight hours. The questionnaire ships on time with high quality and the security team gets their week back.

The three options are not equally costly. Option one saves labor by spending accuracy. Option two saves accuracy by spending labor. Option three is the only one that respects both constraints. The split is not just a tooling preference; it is the difference between a sustainable questionnaire function and one that degrades over a year.

What we measure per questionnaire

Four numbers we track for every shipped questionnaire:

  1. Auto-answer rate. Percentage of questions handled by the retrieval+verifier path. Healthy range: 70% to 85% for security questionnaires. Below 70% means the KB is under-built or the confidence floor is set too high. Above 85% is suspicious — likely either rubber-stamping or an unusually template-heavy questionnaire.
  2. SME SLA attainment. Percentage of SME tickets resolved within 48 business hours. Target: 90%. We publish this to customers in their monthly workflow report so they can intervene when attainment drifts.
  3. Post-ship correction rate. Percentage of shipped answers that had to be corrected after the questionnaire was delivered, either because the buyer caught something or because the analyst caught it on a re-read. Target: below 2%. Higher means the quality pass is too thin.
  4. Time from receipt to ship. Business hours. Median target for security questionnaires is three business days; the p90 is five. Exceeding either means the team is overloaded or the KB is drifting.

The four numbers are correlated. An under-built KB shows up first as low auto-answer rate, then as SLA pressure on SMEs, then as quality-pass slippage, then as post-ship corrections. By the time corrections are visible, the team has been underwater for weeks. Watching the first number is how we catch the pattern early.

Closing

The split is the important thing. A security questionnaire is mostly retrieval. A small and important tail is not. A tool that treats every question as a draft job will hallucinate on the 80% and miss the 20%. A tool that treats every question as an SME job will ship late.

The right architecture is: retrieve and verify the 80%, escalate the 20% with a tight context package, and keep the KB blocks fresh so the 80% doesn’t silently decay. If you want to go deeper on the adjacent mechanics, the DDQ response playbook covers the end-to-end operational flow, the grounded-retrieval pillar covers the verification discipline, the security-questionnaire ingest post covers how we build security blocks from source PDFs, and the DDQ anatomy part 3 covers the security questionnaire itself as an artifact.

The part we keep coming back to: the 80% is boring work. Boring work is the kind that repays automation if — and only if — the automation refuses to be clever on the part it can’t verify.

Sources

  1. 1. Safe Security — The state of enterprise security questionnaires (2024)
  2. 2. Stanford HAI — Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools (2024)
  3. 3. Loopio — How long does it take to respond to a DDQ?
  4. 4. AutogenAI — Hallucinations and the limits of LLM proposal generation
  5. 5. AICPA — SOC 2 trust services criteria
  6. 6. CAIQ — Cloud Security Alliance Consensus Assessments Initiative Questionnaire
  7. 7. SIG — Standardized Information Gathering Questionnaire

See grounded retrieval in the product.

Start a trial workspace and watch PursuitAgent draft cited answers from the documents you provide.