Shipped: bulk RFP ingest with duplicate detection

Bulk ingest went live yesterday. You can now upload up to 10 RFPs at once and have them parsed, chunked, and indexed in about a minute — assuming each is under 80 pages. Larger documents queue serially behind the batch. This is a refinement of the ingest path described on the RFP Analysis and Knowledge Base platform pages; the duplicate detection in particular is the same dedupe surface the KB page promises.

What’s in the release

Parallel extraction. Each RFP runs through its own extraction worker. LlamaParse is the default; Adobe PDF Extract is selectable per-company for shops with an Adobe license. No change to the extraction code — the change is in the queue configuration, which now holds 10 concurrent workers instead of one.

Block-level duplicate detection. This is the part that mattered most. Federal RFPs reuse clauses across agencies — FAR flow-down language, standard contract terms, boilerplate evaluation criteria. When you ingest five federal RFPs in a row, the same clause appears in all five. Without dedupe, your KB ends up with five copies of the FAR 52.219-14 text and your retrieval starts returning the same content five times.

The dedupe runs at the block level post-chunking. A block is a semantic unit — a requirement, a clause, a paragraph. We hash the normalized text (whitespace-collapsed, lowercased, stopwords preserved) and look for exact and near-duplicate matches. Exact matches get consolidated to a single block with a multi-source reference; near-duplicates (>90% similarity) get flagged for review but kept as separate blocks.

Per-document attribution preserved. A consolidated block remembers every source RFP it appeared in. When you retrieve that block in a later response, you see the citation list — “this clause appears in 5 ingested RFPs, including [names and dates].” The retrieval behaves as one block; the provenance is fully traceable.

What the numbers look like

On our internal test batch of 40 federal RFPs:

8,400 total blocks before dedupe.
5,200 unique blocks after dedupe — a 38% compression.
Zero false-consolidations detected in a manual review of the 200 flagged near-duplicates.

Your mileage depends on document mix. State and local RFPs share less boilerplate than federal. Commercial RFPs share almost nothing. Federal is where dedupe earns its keep.

What changed in the UI

The ingest page now shows a per-document progress row during batch processing. Each row reports parse status, chunk count, and dedupe count. The batch completes when all rows finish or error; errors don’t block the batch.

What didn’t ship

We deliberately did not add cross-company dedupe. Blocks from one company’s KB do not get consolidated with another company’s blocks, even if the text is identical. This is a correctness decision — your KB is your KB, and a compliance clause one tenant has approved is not the same artifact as a compliance clause another tenant has approved, even if the characters match.

Previous iteration: Shipped: multi-doc RFP ingest.
How the extraction layer works: Ingest pipeline — LlamaParse.
Chunking details: Chunking pipeline, end to end.

Docs at /platform/kb-builder/ingest are updated with the batch endpoint and the dedupe configuration.

Unbylined posts come from the PursuitAgent team collectively.