Field notes

Why we don't do autonomous proposal agents yet

An opinion piece. What an agentic drafting system would have to guarantee that retrieval doesn't, why we don't think the category is ready, and the work we'd want to see before changing our position.

Bo Bergstrom 6 min read Category

This is an opinion piece. The opinion: PursuitAgent is not building an autonomous proposal agent right now, and I don’t expect us to in 2025.

The framing is everywhere. “Agent writes the bid for you.” “Agent reads the RFP, builds the compliance matrix, drafts the response, and submits.” Some of this is real engineering work. A lot of it is the natural extrapolation of “look how much an LLM can do in a single prompt — imagine what it can do in a loop.” The pattern is well established in adjacent categories. It’s also where the math gets uncomfortable for proposal work specifically, and I want to write down why we are sitting it out for now.

The retrieval gate is hard enough

The discipline of grounded retrieval — pointer, provenance, entailment, refusal — is hard work that took us a year to get to a number we are willing to publish (currently 0.94 claim-level entailment, with a ceiling we are pushing on). That number is for a single draft step: one question, one source block, one sentence, verified. The gate runs once per generated sentence and refuses if entailment doesn’t hold.

An autonomous agent loops over many of these steps without a human in between. Each loop step has to satisfy the same invariants. If a single-step grounded draft has a 0.94 claim-level entailment rate, a five-step agent that needs every step to be correct has a compounding-failure surface. The math is not literally 0.94^5 because the steps aren’t independent, but the qualitative shape — agent reliability degrading multiplicatively across an unsupervised loop — is the right shape to worry about.

The Stanford HAI paper’s 17–33% hallucination rates on grounded legal RAG systems are a single-step measurement. The Hacker News thread debating whether RAG can solve hallucinations is also single-step. Multi-step agentic systems with autonomous decision-making are not better than the single-step floor. They tend to be worse, because the loop accumulates errors that no human has the chance to correct.

What an agentic system would have to guarantee

Three things, none of which the category has shown me publicly yet.

Per-step entailment with the same rigor as single-step. Every step the agent takes that produces text — drafting a section, summarizing a source, selecting a win theme — has to pass the same gate as a single-step draft. Pointer, provenance, entailment, refusal. The agent does not get a “trust me, I’m working on it” pass between steps.

Inter-step accountability. When the agent fails — and it will — the failure has to be locatable. Step 3 generated the bad sentence; the bad sentence influenced step 5; step 7 included the influenced sentence in the final output. The reviewer needs to be able to walk the chain and find the failure. Most agent demos I have seen do not produce this trace; they produce a final output and a “thinking” log that is not the same thing as a verifiable causal chain.

A refusal posture stronger than the underlying single-step refusal. A multi-step system has to be more conservative than its components, not less. Composition introduces failure modes that don’t exist at the single-step layer (drift across steps, conditioning on bad earlier outputs, planning failures). The refusal threshold for the agent has to be set above the threshold for any single step. The current category default is the opposite: agents are pitched as “more powerful,” which usually translates in practice to “less likely to refuse.” That is the wrong direction for proposal work.

AutogenAI’s own blog post on hallucination names the failure modes specifically — invented case studies, fabricated statistics, incorrect compliance claims — and recommends operator-side mitigations. An autonomous agent is, by construction, the system that removes the operator from the mitigation loop. The recommendation and the architectural pitch are in tension.

Where I might change my mind

I am not committed to never building this. Three things would change my read.

Per-claim verification at agent-step granularity, with a published gold-set entailment rate at or above the single-step number. If we can show that a multi-step agent maintains or exceeds the single-step claim-level entailment rate on a held-out gold set, the math gets less uncomfortable. The Hacker News thread on Mayo Clinic’s reverse-RAG pattern is the closest published exploration of how this might work; it is not yet in production at scale.

A published agent-trace contract. A standard for how an autonomous system records its decisions such that a reviewer can walk the chain. We don’t have one. The community is sketching toward it — there are proposals for “agent observability” in the open-source RAG ecosystem — but nothing has stabilized. When something does, we will look at it seriously.

A category use case where agentic loops produce a quality lift retrieval-with-human-in-the-loop can’t match. The strongest argument for agents in any category is that the autonomous loop catches things a single-prompt system misses. I have not yet seen an example in proposal work where the agentic system found a win theme, a compliance gap, or a draft improvement that a well-designed retrieval system with reviewer prompts couldn’t surface. Show me that example and the conversation changes.

What we are doing instead

Two things, both visible in the product roadmap.

Multi-block entailment, in flight. The research branch I described in the grounded-retrieval pillar — composing entailment across two source blocks for sentences that draw from both — is the closest thing we are building to an “agentic” capability. It is bounded. It has a defined verification step. It does not loop without a human review at the boundary.

Reviewer-assist, not reviewer-replacement. Every product surface we ship in the next two quarters is a tool for the human reviewer to find and fix things faster. Citation-verify buttons. Compliance-gap surfacing. Win-theme consistency checks across sections. None of these remove the reviewer from the loop. All of them make the reviewer’s job 10x faster on the part of the job that is mechanical.

This is a deliberate technology bet. The category will go agentic eventually. The vendors that go agentic too early will ship things that cite badly, fabricate confidently, and lose the buyer’s trust on the first audit. We would rather be the company that ships the un-flashy thing that works than the company that ships the agentic demo and patches the trust failures afterward.

The honest hedge

If, by mid-2026, the math on agent reliability has moved meaningfully — published gold-set entailment rates on multi-step systems at or above 0.94, or a credible verification-trace standard adopted across the open-source ecosystem — I expect us to revisit. I am not making a “we will never” claim. I am making a “the current evidence does not support shipping this” claim. Those are different. The first one is identity; the second is a position you are accountable to update when the evidence changes.

For now: retrieval is the discipline. Agents are the marketing word. We are doing the discipline.

Sources

  1. 1. Stanford HAI — Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools
  2. 2. Hacker News — The issue of hallucinations won't be solved with the RAG approach
  3. 3. Hacker News — Mayo Clinic's secret weapon against AI hallucinations: Reverse RAG
  4. 4. AutogenAI — AI hallucination: how can proposal teams reduce risk?
  5. 5. PursuitAgent — Grounded retrieval pillar
  6. 6. PursuitAgent — Grounded-AI Pledge in code