The system-health check at year one

Every morning at 9am one of us goes through four dashboards. This is the discipline that has caught the regressions that our automated alerting didn’t. A year in, here’s what each dashboard is, what it caught this quarter, and the one that nearly let something through.

Dashboard 1 — Retrieval recall against golden set

A golden set of 1,200 question-answer pairs, pulled from historical proposals with human-labeled expected sources. The retrieval system’s top-3 recall on this set is measured nightly. The dashboard shows the last 90 days.

What it caught this quarter: a 3-point recall drop the morning after a reranker model update. The automated alert fires at a 5-point drop; a 3-point drop is within noise for any single day but is a concerning trend over five consecutive days. The morning review caught the trend on day four. Root cause: the reranker update had a subtly different tokenizer that affected certain category names. Fix: rollback, investigate, re-evaluate on golden set before re-rolling.

See our retrieval evaluation pillar for how this set is maintained.

Dashboard 2 — P95 draft latency by tenant

The retrieval latency budget post has the target numbers. The dashboard shows rolling P50 and P95 for draft generation, split by tenant-size tier. Tenant size matters because large-KB tenants have different retrieval cost profiles than small ones.

What it caught this quarter: a long-tail P95 spike specifically on tenants with >50k blocks. The P50 looked fine. The P95 was 42s vs. our 28s target. The alerting fires on the overall P95; it didn’t fire because the large-tenant cohort is a minority of requests and the aggregate looked healthy. The dashboard surfaced the cohort separately, which is how we saw it. Fix: the HNSW index on the embeddings store had drifted out of its optimal size; a rebuild brought P95 back under target.

This is the dashboard that almost missed the regression — if we’d relied only on aggregate alerting, we wouldn’t have caught it until a large-tenant customer complained.

Dashboard 3 — Citation verification pass rate

Every draft the product produces runs through a verification pass: for each cited claim, is the cited source actually supporting the claim? The pass rate is the fraction of claim-source pairs that survive the check. The target is 97%+. Below 95% is a problem; below 90% is a fire.

What it caught this quarter: a slow drift down from 97.2% to 96.4% over six weeks. No single-day jump, just a slow slide. Investigation showed the drift was tenant-specific — two tenants with heavy DDQ usage had added content blocks faster than the review workflow could process them, and the retriever was pulling in less-vetted blocks. Fix: block-freshness weighting in retrieval scoring, which has since been the subject of a longer post.

See hallucination monitoring in prod for the detection methodology.

Dashboard 4 — Cost-per-response by provider

This is the budget dashboard. Cost per generated response, split by model provider (Anthropic, OpenAI, Gemini). The cost-per-response breakdown post explains the math.

What it caught this quarter: a 14% month-over-month increase in Anthropic costs that wasn’t matched by an increase in response volume. Investigation showed a prompt-caching regression — a change to the system prompt had broken a cache-hit pattern that was saving us on repeat queries. The fix was a two-line revert.

All four are designed to surface trends, not just threshold breaches. Thresholds are the job of the alerting pipeline. The morning review is for patterns too slow or too specific for threshold logic:

Recall sliding 3 points over five days.
P95 spiking only for a tenant-size cohort.
Pass rate drifting down gradually.
Cost climbing without a traffic explanation.

Each of these would take longer to catch — or would be missed entirely — with threshold alerting alone.

What’s not on the dashboard

Things we tried and dropped:

Embedding drift. We briefly monitored whether the distribution of embedding vectors was shifting over time. It was. The shift was small and didn’t correlate with any downstream metric that mattered. We removed the dashboard.
Queue depth. The ingest queue can back up without affecting user-facing latency. We have an alert on sustained depth but don’t need a dashboard.
Individual model-call error rates. The aggregate is meaningful; the per-call rate is noise. Dropped.

The four remaining dashboards each catch a distinct class of regression. Anything that would add a fifth needs to show it catches something the first four can’t.

The ritual

Nine in the morning. Fifteen minutes. One person (we rotate). A one-line note in the engineering channel if any dashboard looks off; deeper investigation if any is. The discipline is the short-and-every-day part, not the dashboards themselves — dashboards without someone looking at them are wall art.