How we chunk proposals for retrieval

A proposal is not a blog post. A 1024-token fixed window cut across a proposal does damage that the same window cut across a help-center article does not. The damage is specific, it is reproducible on common documents, and the failure modes drive a measurable drop in retrieval precision once you start grading retrieval against held-out questions from real RFPs.

This post is about the chunking strategy we run on past proposals and on the content blocks that make up a customer’s KB. We describe what the naive baseline gets wrong, what we changed, and the small TypeScript snippet that does the actual work.

What naive chunking gets wrong on proposals

Proposals have three structural features that a generic chunker handles badly.

Headers carry meaning the body relies on. A section titled “3.2 Information Security — Encryption at Rest” is followed by a paragraph that begins “We support AES-256 with customer-managed keys.” If a fixed-window chunk lands such that the header is in chunk N-1 and the body in chunk N, retrieval against the question “do you encrypt data at rest” matches the header chunk and returns the wrong neighbor — the body that actually answered the question is in a different chunk, with no header context attached.

Tables and matrices break catastrophically. A compliance matrix with 80 rows and a fixed-window chunker is a comedy. The chunk boundary lands mid-row. Half the row is associated with the wrong header. A retrieval call for “RPO requirement” matches the cell label but pulls the wrong RPO value from the row above or below.

Numeric clauses are short and dense. “RTO: 4 hours. RPO: 15 minutes.” is 11 tokens. A 1024-token window swallows this clause inside a paragraph about disaster recovery philosophy, and the philosophy paragraph dominates the embedding. The numeric clause becomes invisible to a question that asks for the number specifically.

These aren’t theoretical. They show up as the “the search is terrible” pattern in Responsive reviews on G2 — a search returns close-but-wrong neighbors, the user can’t find the actual answer, and the AI suggestion built on top inherits the same retrieval miss. Stanford HAI’s legal-RAG study shows the downstream effect: when retrieval surfaces an adjacent-but-not-supporting passage, the generator cites it anyway. Hallucination rates ran 17 to 33% even with retrieval enabled.

The strategy: structural first, semantic second, overlap third

Our chunker runs in three passes.

Pass 1 — Structural split

Before any token counting happens, we walk the document tree and produce candidate chunks at structural boundaries. For markdown and HTML inputs that means heading levels (H1 through H4), list items, table rows, and explicit section breaks. For PDFs run through extraction (LlamaParse in our default path), we use the structural map the extractor produces — pages, sections, table cells.

A chunk produced by structural split carries metadata: the header chain that contains it, the document section ID, and a structural type tag (paragraph, table-row, list-item, heading, clause).

type StructuralChunk = {
  text: string;
  headerChain: string[];          // ["3", "3.2 Information Security", "Encryption at Rest"]
  docId: string;
  blockType: "paragraph" | "table-row" | "list-item" | "heading" | "clause";
  pageRef: { page: number; bbox?: [number, number, number, number] };
};

export function structuralSplit(doc: ParsedDoc): StructuralChunk[] {
  const out: StructuralChunk[] = [];
  walk(doc.root, {
    onHeading: (h, ctx) => ctx.pushHeader(h.text, h.level),
    onParagraph: (p, ctx) =>
      out.push({
        text: p.text,
        headerChain: ctx.currentHeaderChain(),
        docId: doc.id,
        blockType: "paragraph",
        pageRef: p.pageRef,
      }),
    onTableRow: (row, ctx) =>
      out.push({
        text: serializeRow(row),
        headerChain: [...ctx.currentHeaderChain(), row.tableCaption ?? ""],
        docId: doc.id,
        blockType: "table-row",
        pageRef: row.pageRef,
      }),
    // ...list items, clauses, etc.
  });
  return out;
}

Two things matter here. First, the header chain is attached to the chunk, not left in a sibling chunk. When we embed, we prefix the chunk text with a serialized header chain so the embedding sees the section context. Second, table rows are atomic — we never split a row across chunks, even if doing so would balance token counts more evenly.

Pass 2 — Semantic packing

Structural chunks are uneven. A clause might be 11 tokens; a paragraph might be 600. We pack adjacent same-section structural chunks into target-size groups (we currently target 320 tokens, with a hard ceiling at 512), but we never pack across a heading boundary. A clause that lives alone in its section stays alone — we’d rather embed an 11-token chunk and accept the noise than glue it to an unrelated neighbor.

The semantic-packing step also handles a specific case that mattered: when we see a heading followed by exactly one short paragraph, we keep them as a single chunk with the heading inline. Numeric-clause sections (the “RTO/RPO” pattern) almost always look like that.

Pass 3 — Overlap, and only where it helps

Overlap helps when two adjacent chunks share a continuous argument — a paragraph that introduces a concept and the next paragraph that extends it. Overlap hurts when the chunk boundary sits between a header and an unrelated body — overlapping there just smears two unrelated things across both sides.

So we apply overlap conditionally. We compute a similarity score between adjacent semantic-packed chunks (cosine over a fast embedding model) and only carry a 64-token tail from chunk N into chunk N+1 when their similarity clears a floor we tuned on a held-out validation set. Below the floor, the chunks remain hard-bounded.

const OVERLAP_TOKENS = 64;
const SIMILARITY_FLOOR = 0.42;

export async function applyConditionalOverlap(
  chunks: PackedChunk[],
): Promise<PackedChunk[]> {
  const out: PackedChunk[] = [];
  for (let i = 0; i < chunks.length; i++) {
    const cur = chunks[i];
    if (i === 0) {
      out.push(cur);
      continue;
    }
    const prev = chunks[i - 1];
    if (prev.docId !== cur.docId || !sameHeaderRoot(prev, cur)) {
      out.push(cur);
      continue;
    }
    const sim = await fastCosine(prev.text, cur.text);
    if (sim >= SIMILARITY_FLOOR) {
      out.push({
        ...cur,
        text: tail(prev.text, OVERLAP_TOKENS) + " " + cur.text,
        hasOverlap: true,
      });
    } else {
      out.push(cur);
    }
  }
  return out;
}

A small thing that matters: we record whether a chunk has overlap on the chunk itself. At retrieval time, when two chunks score close, we use the overlap flag to deduplicate hits that are functionally the same passage — otherwise an overlapping pair returns twice in the top-k and crowds out a genuinely different result.

What this changed in practice

We grade retrieval against a held-out set of question-passage pairs drawn from real customer KBs. After we shipped the structural-first chunker, two things moved measurably on that set: precision-at-3 on tabular questions (“what’s your RTO,” “list your SOC 2 controls”) improved by about a third, and the false-positive rate on header-only matches dropped by roughly half. Specific numbers will go in a separate evaluation post when the methodology is documented in enough detail to be useful — we’re not going to drop a precision-at-k chart in this post and ask you to take our word for the protocol.

What we still get wrong

Two things, currently.

PDFs whose extraction fails to recover table structure get chunked as if they were prose. Our extraction stack catches most of this, but when it doesn’t, the rest of the chunker has no way to know a table existed. We’re working on a structural-recovery pass that runs after extraction.

Documents in languages other than English have weaker semantic-packing similarity scores under our current fast embedding model. The hard structural boundaries still hold, so retrieval doesn’t fail catastrophically, but the conditional-overlap step under-applies. A multilingual model upgrade is queued.

The short version

Structural splits first. Semantic packing inside the structure. Overlap only where adjacency carries continuous meaning. The header chain travels with the chunk so retrieval matches against context, not bare paragraphs. None of this is fancy, but every step we skipped at the start of the project came back as a measurable retrieval miss on a real customer question.