Our chunking pipeline, end to end
Five stages between an uploaded PDF and a retrievable KB block: parse, structural split, semantic rechunk, overlap, and index. Where each one fails and why we kept the boundaries.
Chunking is the unglamorous step that decides whether retrieval works. Get it wrong and your retrieval scores look fine on benchmarks while your reviewers complain that the cited paragraph doesn’t say what the draft claims it says. Get it right and the rest of the stack — embeddings, retrieval, generation — has a fair chance.
This post walks through our chunking pipeline as it runs in production today. Five stages: parse, structural split, semantic rechunk, overlap, and index. Each one is a place where we have made a deliberate choice that we have re-litigated at least once.
Stage 1 — Parse
An uploaded document arrives as a PDF, a Word document, or a Google Doc export. Each of those goes through a different parser. PDFs go through LlamaParse by default, with Adobe PDF Extract as a fallback for structured forms. Word documents go through a python-docx-derived parser. Google Docs go through the Drive API.
The output of parse is not a string. It is a structured tree of blocks: headings, paragraphs, list items, table cells, figure captions, and (where extractable) image regions with surrounding context. Page numbers and bounding boxes are preserved on every block. This is the load-bearing data structure for the rest of the pipeline. If a block doesn’t carry its page reference forward, citations break.
Where this fails: scanned PDFs that hit OCR with marginal quality. We get blocks but the text inside them is noisy. We surface a confidence score on the parse and let the operator review. We do not pretend to retrieve well from a document we couldn’t read well.
Stage 2 — Structural split
The structural split takes the parse tree and produces a list of atomic blocks — the smallest unit that retains semantic coherence on its own. A heading on its own is not an atomic block. A heading plus the first paragraph under it usually is.
The rules are conservative:
- Headings get attached to the next paragraph, not the previous one.
- List items inside a
<ul>or<ol>are kept together unless the list runs longer than 200 tokens — in which case the list splits at natural sub-group boundaries. - Tables stay whole when they fit under 400 tokens. Above that, they split row-wise with the header row replicated to each chunk.
- Figure captions stay attached to their figure block.
-- Simplified shape of the kb_blocks table after structural split
CREATE TABLE kb_blocks (
id UUID PRIMARY KEY,
document_id UUID NOT NULL,
document_version INT NOT NULL,
block_type TEXT NOT NULL, -- heading_para | list | table | figure | diagram
page_start INT NOT NULL,
page_end INT NOT NULL,
parent_id UUID, -- for nested structure (e.g., section -> subsection)
text TEXT NOT NULL,
token_count INT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
The structural split is the part of the pipeline that respects the document author’s intent. If they put a paragraph break, we honor it. We do not slide a 512-token window across the document; sliding windows turn one paragraph into three blocks that are mostly the same content with shifted boundaries, and that wrecks retrieval precision later.
Stage 3 — Semantic rechunk
Atomic blocks are good for fidelity. They are sometimes bad for retrieval, because the smallest unit isn’t always the right unit to embed. A four-sentence paragraph that defines a term, then gives an example, then qualifies the example, then concludes — that’s one semantic unit. Splitting it into four blocks because the structural parser saw four sentences would produce four embeddings of the same idea fragmented across four vectors.
The semantic rechunk merges atomic blocks when three conditions hold:
- They share the same parent (same heading, same list, same table).
- Their concatenated token count fits inside the embedding window with margin (we target 350 tokens with a 512 ceiling).
- A semantic similarity check on adjacent atomic blocks clears a threshold — we use a small embedding model on the candidate pair and a 0.74 cosine floor.
The rechunk does not merge across heading boundaries. A subsection on “Authentication” and a subsection on “Authorization” stay separate even when both are short, because their retrieval contracts are different. A query about authentication should not bring back the authorization block as a near-tie.
Where this fails: dense documents where the structural parser produced one giant block per page. The semantic rechunk doesn’t split — it only merges. For those documents we have a fallback splitter that uses sentence-boundary detection plus the same 0.74 similarity threshold to find natural splits. It runs only when the structural parser produces blocks above 800 tokens; for normal documents it never fires.
Stage 4 — Overlap
Overlap is where we hedge. Even with careful boundaries, retrieval will sometimes hit the boundary itself: a query that semantically belongs to “the end of block A and the start of block B” is a known retrieval failure mode. Sliding-window approaches handle this clumsily by duplicating content. We do something narrower.
For each block, we compute a 60-token prefix shadow (the last 60 tokens of the previous sibling block) and a 60-token suffix shadow (the first 60 tokens of the next sibling block). Shadows are stored as separate columns. They are not embedded with the main block. They are searched against only when the main block’s score is above the retrieval floor but below the confidence-gate threshold — in other words, when the block is a borderline match.
The shadow lookup is a cheap second pass. If a borderline block’s shadow contains a strong match for the query, the retrieval surfaces both blocks (the main block and its neighbor) as a compound candidate. The drafting engine then has to decide whether the answer crosses the boundary. Most of the time it doesn’t. When it does, we have a research-branch path for multi-block entailment that we wrote about in the Pledge-in-code post.
Stage 5 — Index
The final stage is the embedding and index write. We embed each block once with the production embedding model (text-embedding-3-large at this writing) and store the vector in a pgvector column with an HNSW index. We re-embed when the document is re-parsed or when we change the embedding model — never on edit, because edits invalidate citations.
Each block carries a content hash. The hash is a normalized SHA-256 of the block text after whitespace collapse. When a document is re-parsed and a block hash matches an existing block in the same document, we retain the block’s ID and citation history. When a hash differs, we mint a new block ID and increment the document version. This is what lets a citation in a 2024 proposal survive a 2025 re-import of the source document, as long as the cited paragraph hasn’t been edited.
CREATE INDEX kb_blocks_embedding_hnsw_idx
ON kb_blocks USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
The index parameters are tuned conservatively. We get slightly worse query latency than a more aggressive HNSW configuration would give, in exchange for higher recall on the long tail of queries that don’t hit the obvious top-1 block. For proposal retrieval, recall matters more than latency at the boundary; reviewers will wait 200 ms longer for a better citation, every time.
Why we kept the boundaries
We’ve considered collapsing some of these stages. The most tempting collapse is structural-split + semantic-rechunk into a single LLM-driven step that takes the parse tree and emits a list of “good” blocks. Models are smart enough to do this. The problem is that the step becomes non-deterministic, expensive, and hard to debug. When a customer says “this block split weirdly,” we want to be able to point at a deterministic rule. An LLM-driven splitter doesn’t give us that. So the pipeline stays five stages.
We’ve also considered dropping the overlap stage. The shadow approach adds storage and a second-pass lookup. We did drop it once, ran a held-out RFP eval, and watched citation-fidelity rate drop by enough percentage points that we put it back the next sprint.
What’s still rough
Two known issues.
Diagrams and tables don’t compose. A diagram extracted from a system architecture document and a table on the same page describing the components are conceptually one unit. Retrieval treats them as two. We’re working on this — the diagram extraction work shipped this week brings diagrams into the same block table, but composing them with neighboring tables is open.
Cross-document blocks. Two documents that describe the same product feature in slightly different language produce two separate blocks. Retrieval surfaces both. The drafting engine doesn’t yet know that they are restatements of the same fact. A reviewer often does. We log these cases and have a research notebook open on cross-document deduplication.
A correct first cite, every time, is the bar. The pipeline above is what gets us close on standard documents. The two open issues above are what stops us from getting there on the harder ones.