The chunk size ablation: 256, 512, 1024 tokens on RFP text
We ran the same retrieval pipeline at three chunk sizes against our RFP-text gold set. Directional results, the tradeoffs that surfaced, and why we don't ship a single global chunk size.
Chunk size is one of those retrieval parameters that feels small and turns out to govern most of what the system can do. Too small, and a chunk doesn’t carry enough context to be retrievable on its own. Too large, and the chunk dilutes — the embedding averages over too much, and the retriever can’t pick out the specific paragraph that answers a specific question. There is no universally-correct answer. There is a corpus-and-query-dependent best, and finding it requires running the same pipeline at multiple chunk sizes and comparing.
This post is the writeup of one such comparison. We ran the retrieval pipeline at 256, 512, and 1024 tokens against our gold-set on RFP text. The results below are directional — we are not going to publish precise percentages on numbers that are sensitive to the specific gold set, the specific embedder, and the specific reranker. The qualitative findings, however, are stable across re-runs and matter for how we ship.
A note on the numbers. Everything below comes from our internal eval harness against our own corpora. This is not a general benchmark and we would not expect the specific numbers to reproduce on an external setup. The shape of the tradeoff — smaller chunks help precision, larger chunks help claim-coverage, the crossover sits near a middle value — is the durable finding. We will publish the harness and a public gold set when both are stable.
The setup
Same pipeline as the retrieval eval pillar. Same gold set. Same embedder. Same reranker. The only thing that varied across runs was the chunk size used at index-build time, with chunk overlap held at roughly 10% of chunk size in tokens.
The corpus for this ablation is our RFP-text subset — a slice of the gold set focused on prior winning RFP responses, security questionnaire archives, and approved past-performance write-ups. About 11,000 chunks at the 512-token baseline (the others scale roughly inversely with chunk size: ~22,000 at 256, ~5,500 at 1024).
The queries in the gold-set slice are 412 question-style queries — the kinds of things an SME or a writer would type into the search bar when drafting an answer. We deliberately excluded identifier-shaped queries from this slice because we already know the BM25 fallback dominates that path; it is a separate ablation.
What we measured
Three things, in order of how much they varied across chunk sizes.
Precision@5. Headline metric. How often the top 5 retrieved chunks contained at least one labeled-relevant block.
Recall@20. Ceiling check. Whether the right material made it into the top 20 at all.
Claim-coverage rate. After the drafting engine produced an answer from the retrieved chunks, what fraction of substantive claims in the answer traced to a span in a gold-labeled block.
We also tracked latency and embedding cost, but those are well-known to scale with chunk count, not chunk size — so the tradeoff there is mechanical and not interesting to write about.
Directional findings
Numbers below are rounded and described qualitatively. Specific values are sensitive to gold-set version and embedder.
Precision@5
The 512-token configuration was the best on this metric, by a small but consistent margin. The 256 configuration was meaningfully worse. The 1024 configuration was close to 512 but trailing.
The 256 underperformance has a clean explanation. At 256 tokens, a chunk often does not contain enough context for the embedder to disambiguate it from neighboring chunks on a similar topic. Two adjacent paragraphs about, say, incident response — one describing the SLA, one describing the escalation matrix — embed to nearly the same vector. The retriever cannot pick the right one for an SLA-specific query because they look the same. The reranker partially corrects this; the correction is not full.
The 1024 underperformance has a different explanation. At 1024 tokens, a chunk often contains multiple distinct ideas. The embedding averages across them. A query that targets one of those ideas retrieves the chunk, but with weakened relevance score because the chunk’s vector reflects the other ideas too. Precision drops at the top of the ranking, even though the right chunk is technically present.
Recall@20
Recall@20 was less sensitive to chunk size than precision@5. All three configurations cleared the recall ceiling we set as our quality bar, with 512 and 1024 essentially tied and 256 very slightly behind.
This pattern is intuitive once you stare at it. Recall@20 measures whether the right material is in the top 20 anywhere — whether the retriever cast a wide enough net. Larger chunks, by virtue of containing more text, are more likely to contain at least some content relevant to a given query. They make the top 20 list easily. The cost, paid in precision, is that they don’t make the top 5 in the right order.
Claim-coverage
This was the most interesting result. Claim-coverage was best at 1024 tokens. Worst at 256.
The reason is that the drafting engine works better when it has more context per chunk. A claim that requires composition across two adjacent ideas — “we hold SOC 2 Type II and it is audited annually” — can be drafted in one sentence from a 1024-token chunk that contains both ideas. The same claim, drafted from a 256-token chunk, is more likely to result in either an incomplete sentence (only one of the two facts grounded) or a refusal (the verifier rejects the second clause as ungrounded).
The tradeoff: 256 wins more retrievals exactly because the chunk is small and topic-specific, but loses more drafts because the chunk is too small to compose a complete claim from. 1024 loses some retrievals to dilution but produces more complete drafts when the retrieval lands.
512 sits in the middle and is the best single global default. It is not the best at any specific metric and is the best when you have to pick one.
What we shipped
We did not ship a single chunk size as a global default. We shipped a chunk size that varies by document type.
- RFP responses and prior proposals: 512 tokens.
- Security attestations and compliance documents: 1024 tokens. Compliance prose is dense, claims are compound, and the drafting engine benefits from larger chunks.
- Product one-pagers and marketing collateral: 256 tokens. This content is structurally short — a single feature, a single benefit — and 256 tokens reflects the natural unit of meaning.
The decision to vary chunk size by document type was driven by per-corpus eval runs that showed the global-vs-typed split moved precision@5 by a couple of points and claim-coverage by a similar margin in the right direction. Not enormous. Real, repeatable, and worth the additional indexing-pipeline complexity.
What we didn’t try in this ablation
A few honest gaps.
Hierarchical / parent-document retrieval. A pattern where small chunks are used for retrieval and a larger surrounding context is provided to the drafting engine at generation time. This gets the precision benefit of small chunks and the claim-coverage benefit of large chunks, at the cost of a more elaborate retrieval pipeline. We have a research branch on it; it is not in production yet. When it ships, this ablation will get rerun.
Sentence-level chunking. Chunks at the natural sentence boundary, sometimes 50 tokens or fewer. The literature is mixed on whether this works better than fixed-size chunking; on our corpus, an early test suggested it did not, and we did not pursue. It may be worth revisiting.
Embedding-model variation. All three runs used the same embedder. The right chunk size depends on the embedder’s effective context window and on how the embedder responds to dense vs. sparse content. A different embedder might shift the optimal size. We will rerun this ablation on any embedder swap.
Reranker-off baseline. The runs above all included the cross-encoder reranker. The chunk-size effect is presumably stronger without the reranker, since the reranker partially fixes the dilution problem at 1024 and the disambiguation problem at 256. We did not run the reranker-off baseline for this writeup. It is on the to-do.
What this changes for builders
Three things worth pulling out for anyone tuning a similar system.
The right chunk size is corpus-dependent. A single global value will be wrong for some part of your corpus. If your eval setup can support varying chunk size by document type, the gains are real. If it can’t, 512 is a defensible global default for proposal-shaped text.
Precision and claim-coverage move in opposite directions with chunk size. A team that optimizes only on precision@5 will tend to favor smaller chunks; a team that optimizes only on claim-coverage will favor larger ones. The right composite metric weights both, and the balance is not the same for every customer’s corpus.
Hierarchical retrieval is the next move. If you are running a single-stage retrieval pipeline today and considering chunk-size tuning, the bigger win is probably to invest in hierarchical retrieval rather than in finding a slightly better global chunk size. The literature and our own experiments both suggest the gains are larger.
We will follow up with the hierarchical-retrieval ablation when that’s ready. The companion post on the methodology behind these numbers is How we evaluate retrieval quality on our own corpus, and the longer chunking-pipeline writeup is How we chunk proposals.
This post is by the PursuitAgent engineering team. Engineering posts are a shared byline rather than a single author; views reflect PursuitAgent’s position and are written by the engineers building the product.