The cost per response, broken down to the penny

This is the post we have been writing toward for a couple of weeks. The cost per drafted response, broken down by line item, in dollars and cents.

A note up front: we do not publish customer-specific costs. We publish the unit economics that produce customer costs. A customer’s actual run cost depends on their KB size, their question volume, the proportion of questions that resolve via dedupe vs. full retrieval, and a handful of other variables. The unit costs below are the per-call costs at our current pricing posture; the customer cost is what those unit costs add up to for a given workload.

Caveat 2: model and inference prices change, sometimes monthly. The numbers below are accurate at the time of writing. The structure is durable. The specific cents will move as Anthropic, OpenAI, and Google update their pricing — see Anthropic’s pricing page and OpenAI’s for current rates.

The pipeline, by stage

A single drafted RFP answer goes through five chargeable stages.

Embedding the question for retrieval.
Retrieving candidate blocks from the vector store.
Drafting the answer (the rewrite-only LLM call).
Verifying the draft (entailment-checking against the source).
Emitting the result (storage and metadata write).

Plus the prerequisite costs: the corpus had to be ingested (extraction, chunking, embedding) at some prior point, and the storage that holds the corpus and the captured pair has an ongoing cost.

We will walk each one.

Stage 1 — Embedding the question

The question is embedded once per draft. We use a small embedding model (OpenAI text-embedding-3-small at the time of writing, with a backstop on a self-hosted embedding for KB blocks). The cost per question:

Input: ~50 tokens for a typical RFP question, ~150 tokens for a question that has gone through query rewriting (which produces a longer, contextualized query).
Embedding cost: at current pricing, well under one tenth of a cent per question.

For a 100-question response, embedding is a rounding error in the run cost — under a dime total.

Stage 2 — Retrieving candidate blocks

Retrieval is a vector-similarity query against the customer’s KB index. The compute cost depends on the index size and the query type.

For a hybrid retrieval (dense vectors + sparse BM25) over a typical mid-sized KB (50,000 to 200,000 chunks), a single query runs in tens of milliseconds and the marginal compute cost is small. Database operations dominate; LLM costs do not enter this stage.

We allocate roughly $0.001 to $0.003 per retrieval at our current infrastructure cost, depending on KB size. For a 100-question response, retrieval costs add to roughly 10 to 30 cents.

Stage 3 — Drafting

This is the dominant cost. Drafting is a Claude-family LLM call on every question that clears the retrieval floor. The cost depends on model choice and token count.

A typical drafting call:

System prompt (the rewrite-only constraint): ~250 tokens.
Question: ~50 tokens.
Source block: ~600 to 1,200 tokens (we typically pass the top one or two blocks).
Output (drafted answer): ~150 to 400 tokens.

Total input: roughly 1,000 to 1,500 tokens. Total output: roughly 200 to 400 tokens.

At current Anthropic Claude Sonnet 4.6 pricing — input around $3 per million tokens, output around $15 per million — the per-question drafting cost works out to:

Input cost: ~$0.003 to $0.0045
Output cost: ~$0.003 to $0.006
Combined: ~$0.006 to $0.0105 per drafted question, or roughly half a cent to a cent and a tenth.

For a 100-question response where 60 questions reach drafting (the rest resolved via dedupe, refused at retrieval floor, or out-of-scope), drafting costs add to roughly 40 cents to 65 cents.

We do route some longer-form drafts to a more capable model when the response section is multi-paragraph and the source is multi-block. Those calls are a few times more expensive each but represent a small fraction of total volume.

Stage 4 — Verification

Every emitted draft goes through entailment verification. The verifier is a smaller model — we use a Haiku-class model — tuned on entailment data.

A typical verification call:

System prompt: ~150 tokens.
Drafted sentence: ~150 tokens.
Source block: ~600 tokens.
Output (entailment label, optional reasoning): ~50 tokens.

Total: ~950 input tokens, 50 output. At Haiku pricing (input around $0.80 per million, output around $4 per million):

Input cost: ~$0.00076
Output cost: ~$0.0002
Combined: ~$0.001 per verification, or about a tenth of a cent.

Verification typically runs once per drafted answer. For a 100-question response with 60 drafts, verification adds roughly 6 cents.

(Some sentences run through verification twice — once for the initial draft, once after a regeneration when the first verification failed. We log the regeneration rate per account; it is typically under 10% of drafts.)

Stage 5 — Emit

Storage and metadata write per emitted answer. Roughly $0.0001 per answer. For 100 questions, less than one cent total.

The full picture, on a 100-question response

Putting it together for a representative 100-question response with the typical mix:

40 questions resolved by dedupe (no drafting, just embedding + retrieval against prior questionnaires + storage): ~$0.04 total.
50 questions resolved by full retrieval pipeline (embedding + retrieval + draft + verify + emit): ~$0.40 total.
10 questions refused (embedding + retrieval + early-stage exit): ~$0.02 total.

Total compute cost for the response: roughly 50 cents.

This is the unit cost. The price we charge per response is materially higher because the price has to cover infrastructure overhead, the support load, the engineering investment that makes the pipeline produce useful refusals instead of fabricated answers, and a margin that lets the business survive.

The headline gap between cost and price is the gross margin. We are not, in this post, publishing the gross margin number — that is a business datum we will share when we are confident in its stability across customer mix. But the cost side of that calculation is the number above. If a customer wants to know what we spend on their compute, the math is reproducible.

Worked example — a 300-question security questionnaire

The pipeline post on security-questionnaire ingest cited a $5 to $15 cost range for a 300-question instrument. Here is the math.

180 questions resolved by dedupe (60% dedupe rate): 180 × ~$0.001 = ~$0.18.
90 questions through full retrieval pipeline (30% reach drafting): 90 × ~$0.008 = ~$0.72.
30 questions refused or out-of-scope (10% refused): 30 × ~$0.002 = ~$0.06.
Verification for the 90 drafted questions: 90 × ~$0.001 = ~$0.09.
Storage and emit overhead: ~$0.05.

Total: ~$1.10 for a 300-question questionnaire under the typical mix.

The $5 to $15 range cited in the earlier post includes wider scenarios — first-time customers with no dedupe history (every question goes through full retrieval, costs scale roughly proportionally), questionnaires that include multi-paragraph descriptive answers (longer drafts, more expensive model routing), and the occasional re-run when the initial response gets revised.

The point is not the specific dollar number. The point is that the compute economics work at the scale of “fractions of a dollar per response,” and that the customer’s price reflects the engineering and support infrastructure around the compute, not the compute itself.

What we do not yet break out cleanly

Ingest cost amortization. When a customer ingests a 500-page PDF into the KB, the extraction and embedding costs are real but spread across all future questions that retrieve from that document. We have an internal model for amortizing ingest cost across retrieval volume, but the model is noisy because retrieval volume per document varies wildly. Per-response cost above does not include ingest amortization.

Storage at scale. A KB of 200,000 chunks at our current vector storage cost is small. A KB of 5 million chunks (a large enterprise customer with a deep historical corpus) is materially more expensive. We are not yet publishing the storage-cost-vs-corpus-size curve because we do not yet have customers across the full distribution to calibrate it.

Reranker cost. A subset of questions trigger a reranker pass over the top-K retrieval candidates. The reranker is a small inference call, fractions of a cent each, and not currently broken out as a line item. We will cover the reranker specifically on Thursday.

Multi-turn flows. When the proposal manager asks the agent a follow-up question against the same response context, the conversation accumulates tokens. A sustained multi-turn session can run several drafts’ worth of token cost. We track this per-account; it is not a dominant share of total cost in normal use.

Why we are publishing this

Two reasons.

One — we made a public pricing commitment at launch. The cost-per-response math is the most defensible part of that commitment because it is reproducible from public model pricing. We can publish it without publishing customer-specific data, and we should.

Two — the proposal-AI category has historically priced on opaque value rather than on transparent cost. The result is that buyers cannot tell whether the AI features they are paying for are computationally cheap features marked up generously or genuinely expensive features priced fairly. Publishing the cost side of the math lets a buyer reason about the markup. We think the markup is justified by the engineering and the support around the compute, but we want the buyer to be able to make that judgment with the data, not without it.

If the math above does not match what you see in your own account’s billing, write to us. The numbers are reproducible from the logs in the engineering tab. Mismatches are bugs.

The short version

A drafted RFP answer costs us, in compute, fractions of a cent. A 100-question response runs about 50 cents in compute. A 300-question questionnaire runs around a dollar. The price we charge is higher because the compute is the smallest part of producing a usable response — the engineering, the support, and the verification that makes the output trustworthy is the larger part. Both halves of the math should be visible. This post is the first half.