Query understanding, a year on: where the model won

A year ago we had a long argument about query understanding. One side wanted hand-written rules — a deterministic rewrite layer that mapped known patterns of RFP questions to known retrieval strategies. The other side wanted an LLM query rewriter that could handle the long tail without someone writing a rule for every case. We ran both in parallel for the year. This post is what we learned.

The short version: the LLM won most of the time, the hand-written rules still beat it on specific patterns, and the production system now runs a hybrid where each side handles what it is demonstrably better at.

The setup

The retrieval problem for RFP questions is not the retrieval problem for web search. Questions in RFPs are often multi-part (“describe your approach to X, including A, B, and C, with specific reference to Y”), domain-loaded (specific regulatory citations, specific technical acronyms, specific customer-side terms of art), and structurally ambiguous (a single question that touches three different parts of the KB). A naive embedding of the full question retrieves plausible blocks that are not the best match for any of the sub-parts.

We had seen the problem in the practitioner record — Responsive reviews flag exactly this failure mode: keyword-match search on RFP questions surfaces “completely unrelated” blocks because the question is too compound for a single retrieval. The Stanford HAI paper on legal RAG identifies the same pattern in a different domain.

The two approaches we ran in parallel:

Hand-written rules. A deterministic rewriter with roughly 60 pattern matchers. Each matcher handled a specific question shape — “describe your approach to (X) including (A, B, C)” would split into three sub-queries, each with a boosted keyword based on the sub-part. Rules were fast (sub-millisecond), predictable, and easy to debug. They covered about 70% of questions in our production mix.

LLM query rewriter. A small model prompted to take an RFP question and return 1-3 rewritten queries, each annotated with intent. The rewriter had no explicit rule set — it relied on in-context examples and the model’s general understanding of question structure. It was slower (50-200ms), non-deterministic, and occasionally made confident mistakes that a rule would not have made.

What happened

A year in, aggregated across the production evaluation set, the LLM rewriter produced higher-quality retrievals than the rule set on 73% of questions. The rule set produced higher-quality retrievals on 19% of questions. Tie on 8%.

The 73% is not the interesting number. The interesting number is the 19%.

Where the rules still won

Three patterns where the hand-written rules beat the LLM rewriter consistently.

Pattern one: highly-specified regulatory citations. Questions of the form “describe your compliance with FAR 52.204-21” need the exact citation preserved in the rewrite. The LLM rewriter would sometimes paraphrase the citation — “your approach to basic safeguarding controls” — which retrieved related but less precise blocks. The hand-written rule preserved the citation verbatim and anchored retrieval on a block that actually referenced FAR 52.204-21 by number.

Pattern two: multi-part compound questions with explicit sub-structure. Questions that literally list “(a), (b), (c)” as sub-parts do better with deterministic splitting than with model-inferred splitting. The LLM rewriter produced reasonable splits, but the rule produced exact splits, and exactness on the structural frame mattered.

Pattern three: template questions in recurring DDQs. The 40% of DDQ questions that recur across every security questionnaire (“describe your data retention policy,” “describe your incident response procedure”) are better served by a direct template match than by a rewrite. The rule-based path recognizes the template and retrieves the canonical answer block directly. The LLM rewriter would re-ask the same question in three variations and retrieve three near-duplicate blocks.

In each of these three patterns, the hand-written rule is not smarter than the model. It is more disciplined. The rule does exactly the thing that the evaluation shows is right for the pattern, without the model’s tendency to find a slightly different formulation of the same thing.

The hybrid, today

Production now routes queries based on a fast classifier that sorts the question into one of four buckets:

Regulatory-citation bucket → hand-written rule path.
Compound-structure bucket → hand-written rule path.
Template-match bucket → direct template retrieval, bypassing the rewriter entirely.
Everything else → LLM rewriter.

Buckets 1-3 cover about 35% of production questions. Bucket 4 handles the rest. The classifier is a small model (not the rewriter itself) trained on labeled historical data.

The hybrid is materially better than either approach alone. Retrieval quality is 11% higher than the LLM-only system and 34% higher than the rule-only system on the production evaluation set. Latency is lower than LLM-only because a third of queries skip the rewriter. Cost is lower for the same reason. The engineering cost of maintaining 60 hand-written rules is real but not growing — new rules are added roughly monthly, old rules are retired as the LLM improves, and the long-tail cases the rules used to handle have mostly migrated to the model’s turf.

What this means for the next year

Three things we expect to change in year two.

The LLM’s share will grow. Model capability keeps improving. Today the rules beat the model on 19% of questions. In 12 months the number will probably be closer to 10%. Some of the regulatory-citation handling can move to the model as context windows and instruction-following reliability improve. The compound-structure handling will take longer; the template-match handling may not move at all, because templates are cheap and fast and nothing structurally needs to beat them.

The classifier becomes the leverage point. The hybrid is only as good as the classifier that routes questions to the right path. Mistakes in the classifier are the dominant source of retrieval failures now. We will spend more of next year on the classifier than on either of the underlying paths.

Hand-written rules will remain a category. There is a school of thought in ML that says hand-written rules are legacy, and the endgame is all model. We do not believe that. For specific narrow patterns where the rule encodes a structural constraint the model does not reliably honor, the rule is not legacy — it is a different tool for a different job. The hybrid search post makes the same point about dense and sparse retrieval. Not every retrieval problem is a model problem.

The practical takeaway

If you are building RFP retrieval, start with an LLM rewriter — it covers the common case. Add rules for the narrow patterns where you can prove the rules win. Route with a classifier. Measure retrieval quality on a held-out set, not on LLM self-judgment. Accept that the mix will shift toward the model over time, and treat the rules as a pragmatic tool rather than a dogmatic position.

The argument we had a year ago was the wrong argument. It was not “rules vs. model.” It was “which tool for which pattern.” A year of running both in production made the answer concrete enough that we stopped arguing about it.