Cost control for RAG: daily budgets, fallback models, burn alerts

A grounded-AI proposal tool runs on model tokens, and model tokens cost money. We’ve written before about cost per response — the unit economics of one bid. This post is about the control plane around that: how we keep daily spend predictable, how we fall back when a tenant exceeds their budget, and how the alerting works before the bill spikes.

The shape of the problem

A typical proposal response runs 30 to 80 questions through the grounded-drafting pipeline. Each question hits hybrid retrieval, a reranker, a drafting model, and a verifier. Token spend per response is a few hundred thousand input tokens (mostly retrieval context) and a few tens of thousands of output tokens. With current pricing, a response costs us $4 to $25 in model spend, depending on length, complexity, and how aggressive the verifier has to be.

The 5x spread between cheap responses and expensive responses is the variability that makes naive spend forecasting useless. A small tenant with one sprawling DDQ in a week can outspend a large tenant with three small RFPs.

The control plane has three layers.

Layer one — daily budgets per tenant

Every tenant has a daily token-spend cap, expressed in dollars rather than tokens to keep it intuitive. Defaults vary by plan tier. The cap is enforced at the API layer:

async function checkBudget(tenantId: string, estimatedCost: number) {
  const today = await spendToday(tenantId);
  const cap = await tenantCap(tenantId);
  if (today + estimatedCost > cap) {
    throw new BudgetExceeded(tenantId, today, cap);
  }
  return { remaining: cap - today };
}

When the cap is hit, the system does not fail silently. It surfaces a banner in the proposal builder, an email to the tenant admin, and — depending on the tenant’s policy — either pauses generation or falls back to layer two.

Layer two — fallback models

A tenant who hits their budget has a choice configured in advance:

Hard pause — generation stops. The tenant admin must extend the budget before drafting resumes.
Soft fallback — generation continues on a smaller, cheaper model tier for the remainder of the day.

The soft-fallback configuration uses a per-task model map. The drafting model can fall back from a frontier-tier model to a cheaper general-purpose model. The reranker can fall back from a cross-encoder to a lighter reranker. The verifier does not fall back — verification quality is non-negotiable; if the budget cannot support verification, generation pauses.

This is a deliberate asymmetry. Drafting at lower quality is recoverable (the reviewer catches the gaps in red team). Verification at lower quality is a hallucination liability. Cost optimization is allowed to compromise drafting; cost optimization is not allowed to compromise grounding.

Layer three — burn-rate alerts

Daily caps stop runaway spend. Burn-rate alerts catch the trend before the cap is hit.

The alerting rule fires when:

Today’s spend is more than 50% of the daily cap by 11 AM tenant-local time.
Today’s spend is on pace to exceed the cap based on the last 60 minutes’ rate.
Spend in the current hour is more than 3x the trailing 7-day hourly average.

The alert routes to the tenant admin and to our internal ops channel. The internal route is what catches misconfigurations on our side — a runaway agent, a retry loop, a tenant whose KB pulled in a 2,000-page PDF that’s hammering the embedding service.

We’ve caught two configuration mistakes with the burn-rate alerts in the last two months: a tenant who imported their entire SharePoint without realizing every file was being re-embedded on every save, and an internal test job that started a retrieval-evaluation harness in production by accident.

What the dashboard shows

The tenant admin’s view of the cost control plane is a single dashboard with three panels:

Today’s spend — running total, percent of daily cap, projected end-of-day spend at current rate.
Last 30 days — spend by day, segmented by feature (drafting, retrieval, embeddings, verification).
Active rules — daily cap, fallback policy, alert thresholds, and the most recent alerts (acknowledged or active).

The dashboard is the same surface a finance lead at the tenant org uses to decide whether to expand the cap or whether to investigate why drafting got expensive this week. The data is the data — we don’t aggregate it into a “cost score” that hides what’s happening.

What we don’t do

Three things we considered and didn’t ship.

Per-user caps. We don’t enforce limits at the user level inside a tenant. The tenant’s own admin governance is responsible for that. We don’t want to build a quota system that the tenant has to manage; that’s their tool, not ours.

Predictive auto-upgrade. We don’t automatically expand a tenant’s cap when they hit it. The tenant admin decides whether to expand. Auto-expansion creates a class of cost surprise we don’t want to be the source of.

Hidden fallback. When the system falls back to a cheaper model, the tenant sees an explicit indicator on the response. The drafting that happened on the fallback model is not silently equivalent to drafting that happened on the primary model. If quality matters, the tenant should know which model produced which draft.

Why this is non-trivial

A grounded-AI tool that runs on model tokens has a different cost shape than a tool that runs on infrastructure. Infrastructure scales sublinearly with usage; tokens scale linearly. A tool without cost controls runs into either bankruptcy or burst-throttle outages the first time a customer’s usage spikes. The control plane is what keeps the unit economics legible to the customer and predictable to us.

We covered the per-response cost breakdown in the cost-per-response post. The retrieval-eval pillar covers the quality side of the cost/quality trade-off. This post is the operations layer: the rules that make the trade-off enforceable in production.