March reliability incidents, documented

Two incidents in March. One was a degradation — draft latency drifted past the 45-second SLA for about 90 minutes. The other was a full outage — drafts failed for 14 minutes on a Wednesday afternoon during peak federal-FY-Q2 load. Both are documented here because we said we would in the engineering log, and because the fixes are instructive.

Incident 1 — March 9, 14:02–15:31 PT. Draft P95 breach.

Impact. Draft P95 rose from 41 seconds to 78 seconds for approximately 89 minutes. 312 drafts affected across 47 tenants. No drafts failed; they completed slowly. No data integrity issues.

Trigger. A background KB reindexing job for a tenant with ~140k blocks ran against a shared embedding worker pool instead of the dedicated reindex pool. The job saturated the inference queue that drafts share.

Detection. Our P95 alert fired at 14:07, five minutes after the breach started. On-call engineer acknowledged at 14:09.

Root cause. A recent refactor consolidated two worker pools into one with a priority lane, with the intent that reindex jobs would run in the low-priority lane. The priority assignment was correct in code but the environment variable that selected the lane wasn’t set on the job’s deployment manifest. The job ran at default priority, which is the same priority as interactive drafts.

Fix. Immediate: drained the running reindex, re-enqueued with the correct priority flag. P95 recovered within 8 minutes. Durable: added a CI check that fails the deploy if a job declared kind: reindex doesn’t also have priority: low set. That check is now required for all background-work deploys.

What we did differently in the postmortem. We had the alert, we had the root cause in the logs, and we had the fix within 30 minutes of detection. The 89-minute total duration was driven by cautious rollback, not by investigation time. Next time, the CI check means the bad manifest never deploys.

Incident 2 — March 25, 15:47–16:01 PT. Draft generation unavailable.

Impact. Full draft generation failure for 14 minutes across all tenants. 58 drafts queued and retried after recovery. 0 drafts lost. User-visible: an error banner in the composer, queue icons in the KB. DDQ answer generation and RFP analysis were unaffected.

Trigger. An upstream model provider returned HTTP 529 (capacity) on every request for 14 minutes. The retry policy on our side was configured to back off exponentially, but the backoff base was 250ms and the cap was 30s. On a 529 cascade, every draft retried up to the cap and then failed, returning the 529 to the user.

Detection. Our synthetic-draft monitor fired at 15:48 (one minute after the cascade started). On-call engineer acknowledged at 15:49.

Root cause. We had two model-provider paths wired. Our primary was saturated. Our fallback was configured but gated behind a circuit breaker that required 50 failures in 60 seconds before flipping. The 529s arrived as one-minute bursts of about 40 failures each, which didn’t trip the breaker. The fallback never engaged.

Fix. Immediate: operator flipped the circuit breaker by hand at 15:59. Fallback absorbed the remaining requests. P95 was elevated (~ 52 seconds) for ~20 minutes after recovery as the fallback warmed up. Durable: the circuit breaker now trips on either (a) 50 failures in 60s or (b) a moving-average failure rate above 35% over any 3-minute window. We added a runbook entry and a button in our internal console to manually flip the breaker without SSHing into a pod.

What we did differently in the postmortem. We’ve had fallback routing configured for 14 months and hadn’t actually used it in production except in drills. The drill last ran in October. After this incident we added a quarterly live-fire drill — during a scheduled maintenance window, we force-fail the primary and confirm fallback takes over. First drill runs April 8. Subsequent drills every 90 days.

The trigger in both cases was a correctly-designed system with one piece of configuration in the wrong state. Incident 1 was a missing env var. Incident 2 was a threshold too strict for the actual failure pattern. Neither was a novel failure mode. Both were solved by tightening the guard that catches the misconfiguration, not by rewriting the system.

We’re not proud of either incident. We’re publishing them because the alternative — silent postmortems — is worse, and because the teams we admire in this category (and out of it) publish theirs too.

What’s still open

Two questions we’re still working on:

Cross-tenant noisy-neighbor. Incident 1 was a tenant-initiated job affecting other tenants. Our priority-lane scheme handled it once the manifest was right, but the underlying shared-pool pattern remains a risk. A more isolated model — per-tenant worker pools — costs more and is slower to provision. We’re prototyping it behind a flag.
Provider fallback warmup. Incident 2 recovered but the fallback took ~20 minutes to fully match primary latency. A pre-warmed fallback is expensive; a cold fallback under load is slow. The right answer is probably a continuously-warmed fallback at 5–10% of primary traffic, which we’ll test this quarter.

The takeaway

Two incidents, 103 minutes of user-visible impact, zero data loss, and two specific changes that make the same triggers un-firable next time. That’s the bar we’re holding for reliability during peak federal-FY-Q2 load.