Blog · Tag
rag.
38 posts in this archive.
RAG for past-performance reference selection
How the retriever picks the three best past-performance references out of 180 for a given scope. Not cosine similarity on a paragraph — structured retrieval over multiple facets with a scorer that knows what a good reference looks like.
Retrieval over Slack history: what works, what's too sharp
An experiment with RAG over customer-Slack channel history. Three useful retrieval patterns, two failure modes that led us to gate the feature behind explicit capture flags, and the operational guardrails.
What we learned analyzing 90 days of search logs
Three patterns in the KB-search query logs we did not expect, and one UX change we made because of the findings. Notes from a quarterly log review, written in the build-log spirit.
When two citations disagree: how the draft resolves it
Two KB chunks say different things about the same claim. The conflict-resolution logic that decides which one the drafted answer cites — when to prefer newer, when to prefer higher-authority, and when to refuse.
Observability for drafting: traces, logs, and replays
How we debug a bad draft six weeks after the fact. The three-layer observability stack — request traces, retrieval logs, and deterministic replays — that makes post-hoc drafting issues tractable.
Detecting ungrounded spans in drafts, line by line
A per-sentence classifier that flags which spans in a drafted RFP answer lack source coverage in the retrieved context. What it costs, what it catches, and what it still misses.
Retrieval eval snapshot, December 2025
Quarter four retrieval evaluation numbers against our held-out RFP and DDQ corpus. What moved since September, what's still stuck, and which regressions we're not yet fixing.
Tuning pgvector HNSW for proposal workloads
M, ef_construction, ef_search — the three knobs that decide retrieval latency and recall in a pgvector HNSW index. What we chose for PursuitAgent and why.
Our retrieval eval, quarterly report
A quarter of running our retrieval evaluation harness against a frozen gold set: the regressions we caught, the two changes that actually moved precision, and the metric we stopped reporting because it lied.
Cost control for RAG: daily budgets, fallback models, burn alerts
How we keep RAG spend predictable per tenant. Daily budgets, model-tier fallbacks, and burn-rate alerts before the bill spikes — with the dashboard and the rules.
How the draft packet is generated, line by line
The prompt, the retrieval context, and the output template that produce an SME draft packet. A worked example from a real-shaped RFP question to a ready-to-review answer.
The SME draft packet, generated automatically
What we ship to an SME alongside the question so they can answer in five minutes instead of fifty. The packet's components, the retrieval that builds it, and the design choices that keep the SME out of our tool.
Retrieval evaluation, part 2: dealing with numeric claims
Why numeric facts break vanilla retrieval and the two tactics — hybrid search and numeric-claim isolation — that fix it. Continuation of the eval series.
Confidence scores for grounded drafts, explained
What '82% confident' means in our drafting engine, how it's computed from retrieval and entailment signals, and where it leads the reviewer.
Streaming drafts over SSE, with citations inline
How we stream draft output to the browser while keeping citation integrity intact. The architecture, the failure modes, and the part we got wrong twice.
How we curate the retrieval gold set
120 questions, three annotators, a disagreement-resolution protocol. The recipe behind the held-out set we evaluate every retrieval pipeline change against — and the parts we plan to open-source.
Retrieval over diagrams, not just text
How we index D2 code and diagram descriptions so an architecture question can ground to a specific figure. The pipeline, the failure modes, and the citation surface for a diagram source.
The answer provenance graph in the KB
Every block in the knowledge base tracks source, author, approver, and last-used-in. The provenance graph isn't bookkeeping — it's a product surface. Here's what it stores and what it powers.
The reranker that paid for itself
Rerankers add latency and cost. They earn it back when retrieval is borderline and the wrong block in the top-K poisons the draft. Where we run a reranker, where we do not, and the honest tradeoffs.
The cost per response, broken down to the penny
Embedding calls, retrieval compute, draft tokens, verifier tokens, storage. The unit cost structure of a single drafted RFP answer, with a worked example. We publish the unit economics, not customer costs.
Query rewriting for RFP questions with implicit context
Most RFP questions retrieve poorly because they assume context the corpus does not carry. Query rewriting turns 'describe your approach' into a retrieval string that hits. Examples, the rewrite chain, and the cost tradeoff.
The grounded drafting loop, step by step
Retrieve, draft under constraint, verify, emit — or refuse. The four-step loop that produces every drafted answer in PursuitAgent, and the failure mode each step exists to prevent.
The chunk size ablation: 256, 512, 1024 tokens on RFP text
We ran the same retrieval pipeline at three chunk sizes against our RFP-text gold set. Directional results, the tradeoffs that surfaced, and why we don't ship a single global chunk size.
Our eval harness, on the command line
A walkthrough of the dev loop for retrieval changes — one command to baseline, one command to re-run, one to diff. The CLI ergonomics that keep us from tuning by feel.
How we evaluate retrieval quality on our own corpus
Our gold set, the metrics we track, the eval harness on a laptop, the regression-guard CI job, and the directional numbers we'll publicly stand behind. Long read.
The claim-level verification pass, explained
After the draft model writes a sentence, a smaller verifier model reads each substantive claim and asks: is this entailed by the source block? Here's how that works, what it costs, and where it still misses.
Our retrieval latency budget, explained
Where the milliseconds go in a single retrieval call: embedding lookup, vector search, reranker, hybrid merge, payload hydration. P50 120ms, P95 400ms, and what we cut to get there.
Hybrid search: dense embeddings plus BM25 for proposals
Pure dense retrieval misses on numeric identifiers, product names, and SOC codes. Pure BM25 misses on paraphrase. The blend ratio we use, how we tune it, and the test set that catches regressions.
Grounded Retrieval 101, Part 4: what we're still wrong about
The closing post of the Grounded Retrieval 101 series. Three failure modes we have not solved — numeric precision, compound claims, synonym drift — with the test cases that surface them and what we are doing about each.
How the citation rendering stack works
From a retrieval hit to a verify button next to a sentence, in four components. The plumbing behind every cited claim PursuitAgent ships, and why we render the source inline instead of in a footnote.
Grounded Retrieval 101, Part 3: the citation rendering stack
From a verified retrieval hit to an inline citation a reviewer can hover and trust. Four components: citation marker, hover card, source viewer, and audit log.
Testing retrieval: gold sets, precision@k, and why BLEU lies for proposals
Surface-form metrics like BLEU and ROUGE rate proposal text by token overlap. Token overlap is a poor proxy for whether the answer is actually right. Here's the eval stack we use instead.
Grounded retrieval: what it is, what it isn't, what we measure
The canonical long-read on grounded retrieval: the three invariants, the anti-patterns, the eval harness, the four open failure modes, and the research we're running next.
Our chunking pipeline, end to end
Five stages between an uploaded PDF and a retrievable KB block: parse, structural split, semantic rechunk, overlap, and index. Where each one fails and why we kept the boundaries.
Grounded Retrieval 101, Part 2: why citations don't guarantee groundedness
A citation tells you which passage was retrieved. It does not tell you whether the cited passage actually supports the generated claim. Part 2 of the Grounded Retrieval series — the entailment gap, and what closes it.
Grounded Retrieval 101, Part 1: what RAG is and why it still hallucinates
RAG in three sentences, then the hard part: why retrieval-augmented generation still produces fabricated answers, and what the academic and practitioner literature says about it. Part 1 of a four-part series.
How we chunk proposals for retrieval
Fixed-window chunking loses at headers, table cells, and numeric clauses. This post walks through the structural-plus-semantic chunking strategy we run on past proposals and KB content blocks, with code.
How the Grounded-AI Pledge is enforced in code
The Pledge says every drafted answer links to a source in your KB. Here's how the drafting engine enforces that — with refusals, not with model hygiene.
See the proposal workflow
Take the 5-minute tour, then start a trial workspace when you're ready to run a real pursuit against your own source material.