Field notes

A year of ingest pipeline, condensed

Forty changes to the ingest pipeline across a year of shipping. The five that actually mattered, the ones that didn't, and what the pattern says about where to spend the next year's ingest budget.

Bo Bergstrom 5 min read Category

We have shipped roughly 40 changes to the ingest pipeline since the first internal draft in May 2025. Some were rewrites; some were tuning knobs; a handful were the kind of change where we deleted more code than we added. A year in, five of those changes account for nearly all the production wins. The other 35 were necessary, and a few were mistakes, but none of them moved the grounded-draft quality number the way the five did.

This is an operator post, not an engineering one. The engineering team has written the specifics of each piece; I want to talk about the pattern.

The five that mattered

The first: switching from general-purpose PDF parsing to LlamaParse for RFP documents. The difference was not subtle. On a fixed set of 40 representative RFPs we had been grading weekly, extraction fidelity on tables, nested requirement lists, and multi-column layouts went from “usable with manual cleanup” to “usable without.” That made compliance-matrix autogeneration viable as a feature, which is the thing most customers actually wanted from a proposal tool in the first place.

The second: semantic chunking keyed to the document’s own section breaks, rather than fixed-length chunking. The previous approach produced chunks that split requirements across boundaries — a “shall” clause in one chunk, the dependent clause in the next. Retrieval would surface half the requirement and the model would confidently draft against the wrong half. Section-aware chunking eliminated that entire class of failure in a single release.

The third: content-block versioning in the knowledge base, so that when a source document changes, the blocks derived from it are flagged for review instead of silently going stale. Content freshness is a product feature, not a maintenance chore — we wrote a full post on that in October 2025. The shorter version is: a KB that rots is worse than no KB, because the model will confidently retrieve stale answers and present them as current. Versioning was the change that stopped that failure mode.

The fourth: diagram extraction as a first-class ingest step, not an afterthought. RFPs in regulated industries contain network diagrams, process flows, org charts. For most of the first year we extracted their text captions and ignored the diagram content itself. Once we built the extraction path — caption plus D2 code plus a text description — the grounded-draft quality on sections that referenced those diagrams jumped measurably. The change shipped in Q1 2026 and the uplift was bigger than I expected.

The fifth: per-claim evidence linking at draft time. Not a post-hoc audit; a requirement that the drafting loop refuse to produce a sentence without a source. This is the change that most closely ties to our grounded-AI pledge, and the one that most changed how customers talk about the product in renewal conversations.

The 35 that didn’t

The 35 changes that didn’t move the number were mostly micro-tunings. Retrieval hyperparameters. Embedding model swaps. Chunk-size ablations we ran for weeks. We learned from all of them — the chunk-size ablation post is still one of our most-linked engineering pieces — but the lesson was always the same: the variance between “well-tuned system” and “aggressively tuned system” on these axes was small. The variance between “right architecture” and “wrong architecture” was large.

The Stanford legal-RAG paper documents the same pattern in a different domain: retrieval tuning alone does not close the hallucination gap. The architecture decisions — what you retrieve, how you chunk, how you force the generator to cite — are where the hallucination rate actually moves. Tuning gets you maybe a relative 10% on top; architecture gets you the 2x.

Two mistakes worth naming

One: we over-invested in a generic embedding model for the first year, thinking we could retune later. We could retune later, but the retuning cost us two sprints and a noticeable regression in recall we had to chase for a month. If we were starting today we would pick a domain-adapted embedding earlier.

Two: we shipped a version of the ingest pipeline in month eight that assumed a single-document-per-RFP model. About 15% of federal RFPs arrive as zip files with 12 to 40 attachments, and the initial pipeline treated the primary solicitation as “the” document. We noticed the gap three months later when compliance matrices started missing requirements that lived in appendices. The fix was a month of work that would have been two days if we had started with a multi-document mental model.

What the pattern says

Ingest is a place where structural decisions dominate tuning decisions. We will keep tuning, because the tuning is cheap and compounds over time. But the five changes that mattered were structural: a different parser, a different chunk boundary, a different staleness discipline, a different artifact class, a different drafting invariant. Three of those five were decisions to add a whole new class of thing to the pipeline rather than improve an existing class.

The Sparrow team’s piece on content-library practices names the same observation from the KB side: libraries fail not because the retrieval is bad but because there is nothing structurally in place to keep the content credible. Our ingest pipeline failed, when it failed, for the same reason — missing structure, not missing tuning.

What year two looks like

We have five structural candidates on the board for year two. Two are multi-document — smarter handling of amendments and addenda that land after intake, and a formal answer-provenance graph that ties every ingested block to its audit trail. Two are modality — better handling of Excel-based compliance rubrics and a first pass at tabular data as citable blocks rather than text approximations. One is speed — a streaming ingest path so large RFPs don’t stall a team for 40 minutes on upload.

Not all five will land. The one that probably matters most is the multi-document amendment handling, because that is where customers keep hitting edges. We will ship, we will measure, and in April 2027 I will write this post again with a different list of five.

A year in, the ingest pipeline is still the part of the product that has the most leverage per engineering hour. The job is not done. The structure is the thing to get right, and the tuning is the thing to keep doing forever.

Sources

  1. 1. Stanford HAI — Legal RAG hallucinations
  2. 2. Sparrow — RFP content library best practices