Blog · Tag
ingest.
6 posts in this archive.
A year of ingest pipeline, condensed
Forty changes to the ingest pipeline across a year of shipping. The five that actually mattered, the ones that didn't, and what the pattern says about where to spend the next year's ingest budget.
Shipped: bulk RFP ingest with duplicate detection
A short changelog entry. Bulk ingest of 10 RFPs in a minute, with block-level duplicate detection so the same clauses across multiple RFPs don't double-count in your KB.
Turning a SOC 2 PDF into 140 KB blocks
The ingest, the extraction, the linking. A worked trace of how a SOC 2 Type II report becomes the set of KB blocks that DDQ answers cite — with the real pgvector row shape at the end.
Semantic deduplication of KB blocks at ingest
How we merge near-duplicate KB blocks at ingest time using embedding similarity, the threshold we settled on after testing four values, and the trade-off we accept by tuning toward over-merging.
Inside the ingest pipeline: parse, extract, index
How a PDF becomes searchable KB blocks. LlamaParse for parsing, structural-plus-semantic extraction, pgvector indexing with HNSW. Where each stage wins and where it falls over.
Shipped: multi-doc RFP ingest with attachment dependencies
RFPs ship as bundles. The scoring rubric, the technical appendix, the pricing workbook. The Analyzer now ingests all of them as one pursuit, with dependencies tracked between them.
See the proposal workflow
Take the 5-minute tour, then start a trial workspace when you're ready to run a real pursuit against your own source material.