New models, quarterly eval: Sonnet 4.6, GPT-5.2, Gemini 3.1 Pro
An internal eval across three current-generation models for our specific workloads — drafting, claim verification, extraction. What moved, where we switched defaults, and why one workload still sits on a year-old model.
Three current-generation models landed across the providers we track between late 2025 and Q1 2026: Anthropic’s Claude Sonnet 4.6 (Feb 2026), OpenAI’s GPT-5.2, and Google’s Gemini 3.1 Pro (Feb 2026). This post is what the internal eval looked like, which workloads moved, and which one stayed on a year-old model on purpose.
Numbers below are directional. They are produced by our own harness on our own workloads; they are not a general benchmark, and we would not expect them to generalize outside a grounded-RAG proposal context. The purpose of the post is to show how we think about the swap, not to publish a leaderboard.
The harness and the workloads
The eval harness is the same one we’ve run since 2025 (CLI reference), against the same golden set described here. Four workloads:
- Draft generation — write a grounded answer from retrieved KB blocks.
- Claim verification — entail or reject a specific numeric/named-entity claim against a source block.
- Compliance extraction — extract requirement rows from RFP text.
- Question rewriting — rewrite a buyer’s question into 3–5 retrieval queries.
For each workload, we measure a task-specific quality score, end-to-end latency, and cost per request.
Results
Scores are directional. Each row is one configuration we ran; deltas within a column are more meaningful than absolute values.
Draft generation
| Model | Quality (0–100, internal) | P95 latency | Cost per draft |
|---|---|---|---|
| Claude Sonnet 4.5 (prior default) | 89.4 | 14.2s | $0.041 |
| Claude Sonnet 4.6 | 91.1 | 13.1s | $0.039 |
| GPT-5.2 | 90.8 | 11.4s | $0.036 |
| Gemini 3.1 Pro | 87.2 | 9.8s | $0.028 |
Decision: switched default to Claude Sonnet 4.6. Quality delta from 4.5 is small but consistent, and citation-anchor adherence (a sub-score that measures whether claims resolve to the cited block) improved materially. GPT-5.2 is a close second and we’re keeping it wired as the fallback; Gemini 3.1 Pro’s lower quality score came almost entirely from citation-anchor drift, which matters more for us than the latency win.
Claim verification
| Model | Precision | Recall | F1 | Cost per claim |
|---|---|---|---|---|
| Claude Sonnet 4.5 (prior default) | 0.962 | 0.931 | 0.946 | $0.003 |
| Claude Sonnet 4.6 | 0.974 | 0.928 | 0.950 | $0.003 |
| GPT-5.2 | 0.968 | 0.942 | 0.955 | $0.003 |
| Gemini 3.1 Pro | 0.971 | 0.918 | 0.944 | $0.002 |
Decision: switched default to GPT-5.2 for claim verification. This is the workload where recall matters more than raw precision — a missed unverifiable claim is worse than a verified claim flagged for re-check. GPT-5.2’s recall advantage is the biggest delta in the eval. Claude Sonnet 4.6 runs as the fallback.
Compliance extraction
| Model | Precision | Recall | Cost per RFP |
|---|---|---|---|
| Our fine-tuned BERT classifier (prior default) | 0.86 | 0.94 | $0.11 |
| Claude Sonnet 4.6 (zero-shot) | 0.81 | 0.96 | $2.40 |
| GPT-5.2 (zero-shot) | 0.79 | 0.97 | $2.10 |
| Gemini 3.1 Pro (zero-shot) | 0.77 | 0.95 | $1.40 |
Decision: stayed on the fine-tuned BERT. The grammar-based extraction we described in our compliance-extraction revisit still wins on precision and is 15–20x cheaper per RFP. LLMs have slightly better recall on long-tail phrasing but they also hallucinate requirements that aren’t in the document at all. For this workload, a small task-specific classifier beats a generalist.
Question rewriting
| Model | Retrieval MRR@10 | Latency P95 | Cost per rewrite |
|---|---|---|---|
| Claude Haiku 4.5 (prior default) | 0.71 | 780ms | $0.0004 |
| Claude Sonnet 4.6 | 0.74 | 1,420ms | $0.002 |
| GPT-5.2 mini | 0.73 | 890ms | $0.0006 |
| Gemini 3 Flash | 0.72 | 640ms | $0.0003 |
Decision: stayed on Claude Haiku 4.5. The 3-point MRR@10 delta to Sonnet 4.6 doesn’t justify a 5x cost and a 1.8x latency hit on a call that runs every retrieval. Haiku 4.5 is fine for this workload and there’s no reason to move it.
Year-over-year view
Compared to the May 2025 eval, two things changed structurally:
Quality converged. Twelve months ago the spread between the best and worst model on draft generation was about 8 points. Today it’s 4. Choosing a model is now much more about workload fit than about finding a “better” model — most current-gen models are roughly comparable on the easy workloads and differentiated on the hard ones.
Cost dropped materially. Composite cost across our four workloads is down meaningfully since last May. This is the tailwind that let us hold our Team-tier pricing flat (covered in the year-one pricing post) while the underlying model spend fell.
What we didn’t test
We did not test open-weight models for this cycle. Our production stack requires provider-managed routing, data residency commitments, and the citation-anchor features we built on top of specific provider APIs. An open-weight evaluation is a real piece of work and we plan to run one in Q3 2026 on the verification workload specifically — it’s the one where self-hosted inference could plausibly win on cost.
We did not test reasoning modes on any model. The four workloads are all latency-sensitive; none of them tolerate the extra 30–120 seconds a reasoning pass adds. For one-off analysis tasks (an RFP teardown, say) we do use reasoning modes ad hoc, but they aren’t in the production critical path.
The takeaway
Two default swaps (drafting and verification), one non-change (extraction stays on fine-tuned BERT), one non-upgrade (question rewriting stays on Haiku 4.5). The eval drives the decisions; our job is to keep the harness honest. Numbers are ours, on our workloads — reproducing them requires our corpus, which we do not publish.