RAG for past-performance reference selection

A federal RFP asks for three past-performance references relevant to “cloud modernization of legacy HR systems for a civilian agency, $10-30M contract value, within the last 5 years.” The team’s past-performance library has 180 entries. Which three?

This is the retrieval problem we solved badly for the first year and have been fixing iteratively since. This post is how it currently works.

Why the obvious approach doesn’t work

The obvious approach: embed each past-performance entry as a paragraph summary, embed the RFP scope-of-work text, run cosine similarity, take the top three. We built that first. It retrieved terrible references.

Two failure modes dominated.

Facet collapse. A single embedding of a past-performance entry flattens every facet — customer type, contract vehicle, dollar value, technology stack, outcome metrics — into one vector. The cosine similarity is dominated by whichever facet happens to have the most text. An entry with a long technology-stack description gets high similarity against any RFP that lists technology keywords, even when the customer type or contract size is wildly off.

Recency invisibility. Embeddings don’t care about dates. A relevant engagement from 2019 scores the same as an irrelevant engagement from 2024 if the text is similar. Federal past-performance criteria almost always have a recency constraint, and the retriever was blind to it.

The structured approach

We replaced the single embedding with a facet-decomposed retrieval. Each past-performance entry is stored as:

type PastPerformanceEntry = {
  id: string;
  narrativeEmbedding: number[];   // free-text summary
  facets: {
    customer: {
      type: "federal-civilian" | "federal-defense" | "state" | "commercial" | "...";
      named: string;              // "USDA"
      industry?: string;
    };
    contract: {
      vehicleType: "standalone" | "idiq-task-order" | "gsa-schedule" | "...";
      value: { min: number; max: number };
      startDate: Date;
      endDate: Date | "ongoing";
    };
    technical: {
      stackTags: string[];        // ["aws", "oracle-hcm", "java"]
      domains: string[];          // ["hr-systems", "cloud-migration"]
      stackEmbedding: number[];   // embedding of the tech/domain text
    };
    outcomes: {
      metrics: string[];          // ["reduced IT ticket volume 34%", ...]
      outcomeEmbedding: number[]; // embedding of outcome narrative
    };
  };
};

Each RFP scope-of-work section is parsed into a query with matching facets:

type ScopeQuery = {
  narrativeEmbedding: number[];
  filters: {
    customerTypeIn?: string[];
    industryIn?: string[];
    vehicleTypeIn?: string[];
    valueRange?: { min: number; max: number };
    startedAfter?: Date;
    endedAfter?: Date;
  };
  ranking: {
    domainEmbedding: number[];
    outcomeEmbedding: number[];
  };
};

The scoring function

Retrieval runs in three passes.

Pass 1 — Hard filters. Apply the filters from the query. An RFP requiring federal-civilian customers, last-5-years, and $10-30M value range filters down from 180 entries to typically 20-40. This pass is a SQL-level filter, not a vector operation.

Pass 2 — Facet-level similarity. For each surviving entry, compute three similarities:

Narrative similarity (entry’s free-text summary vs. scope narrative).
Technical similarity (entry’s tech/domain embedding vs. scope’s domain embedding).
Outcome similarity (entry’s outcome embedding vs. scope’s outcome embedding).

We found the outcome similarity is the most predictive of “would a reviewer actually find this a good reference.” Outcomes are what buyers scan past-performance writeups for. We weight the three: narrative 0.2, technical 0.3, outcome 0.5.

Pass 3 — Ranking heuristics. On top of the facet similarity score, apply two ranking bumps:

Recency bump: entries completed in the last 18 months get a +0.1 score bonus. Entries completed 3-5 years ago get no bonus; entries older than 5 years (but within the hard-filter window) get -0.05.
Customer specificity bump: entries where the customer’s specific agency matches (USDA, not just “federal-civilian”) get +0.08.

The top three results are returned with their score components visible, so the proposal writer can see why the retriever picked each one and swap if the editorial judgment disagrees.

What we measured

We ran an eval harness against a golden set of 40 RFP scopes, each labeled by a proposal veteran with “ideal top three” references. Metrics:

Naive single-embedding retrieval: top-3 recall against the golden set was 31%.
Facet-decomposed retrieval without ranking bumps: 58%.
Current retrieval with ranking bumps: 72%.

The gap from 72% to 100% is where the veteran’s judgment incorporates things the retriever cannot easily encode — relationship context, strategic fit with the buyer’s next-phase plans, political considerations. We don’t try to automate those. The retriever is a first-pass; the proposal writer adjusts.

What breaks

Three failure modes we have not fully fixed.

Ambiguous scopes. An RFP that lists twelve technical domains equally weighted produces a domain-embedding that averages to generic. The retriever then favors whatever entry has the longest domain text. Fix on the roadmap: a scope-weighting step that identifies which domains the RFP actually emphasizes (based on page count, section weighting, rubric mentions) rather than treating all domains equally.

Small KBs. With fewer than 30 past-performance entries total, the hard-filter pass can drop to zero surviving candidates. We now fall back to facet-ranked retrieval across all entries when the filter pass produces fewer than three. It works; it’s not ideal.

Outcome-metric quality variation. Entries written by different proposal managers describe outcomes with different specificity. Some say “improved efficiency”; others say “reduced IT ticket volume 34%.” The outcome embedding rewards specificity, which means newer (better-written) entries get favored over older ones regardless of relevance. We surfaces this in review mode — the writer sees the outcome-text and decides whether to override.

Where this sits in the draft pipeline

Past-performance retrieval runs early in the draft pipeline. The retrieved references become inputs to the past-performance section generator, which produces a structured writeup per reference. See past performance in three sentences and past performance that actually maps for how the writeups are structured.

The retriever itself is one of our newest grounded-retrieval components. It builds on the grounded retrieval 101 foundations and uses the embedding model selection decisions from last fall.

Takeaway

Cosine similarity on a single embedding is the wrong algorithm for past-performance retrieval. The facets that matter — customer type, contract value, recency, outcome metrics — are structured fields the writer already fills in at ingest. A retriever that respects those fields, weights the text similarities by facet, and applies recency and specificity bumps, gets recall from 30% to 70%+. That is the difference between a retriever the writer uses and a retriever the writer overrides every time.

Posts bylined to “The PursuitAgent engineering team” are written by the people building the product.