Field notes

Q1 — what we got wrong in ninety days

Ninety days into the public phase of the company. Five specific things we got wrong, what we changed, and what is still open. Written in the same spirit as the launch post — if I cannot say it on the blog, the discipline is theater.

Bo Bergstrom 10 min read Category

Ninety days ago we launched the public phase of PursuitAgent — the marketing site, the blog, the published prices, the Grounded-AI Pledge, the first wave of self-serve customers. The ninety-day mark is one of the few points in a company’s life where you can do an honest retrospective without having to either claim victory prematurely or relitigate decisions that have not had time to play out.

This post is the retrospective. Five things we got wrong. What we changed. What is still open. Same rule as every other post on this blog: if I would not write it here, the discipline is theater.

Wrong one — we under-invested in the in-product onboarding

The launch posture assumed that customers who bought through the self-serve tier would arrive in the product with enough context — from the marketing site, from the blog, from the documentation — to get to value on their own. The published pricing page made the buying decision concrete, and we expected the in-product experience to be intuitive enough that the first week was largely self-driven.

It was not. The first wave of self-serve signups produced a noticeable bump in support tickets, most of which clustered around the same handful of friction points: where to upload the first KB document, what to do when the system refused on a question (the refusal surface was honest but not actionable), how to interpret the citation pointers in the drafted output. The product was working as designed; the onboarding path was not.

What we changed: we shipped an in-product onboarding flow with three guided checkpoints (upload your first document, run your first question, review your first refusal). The refusal surface now includes a structured “what happened, what to do” panel. Documentation got rewritten to lead with the loop the customer is actually running, not with feature reference material.

What is still open: the second-week experience. The first-week onboarding got attention. Customers in week two — past the initial guided flow, into the actual production work — still hit a few friction points the team has not fully addressed. The pattern is recoverable, but it is a real gap in the experience. Working on it.

I covered the cost side of this in the ninety-day pricing post — the support load was higher than expected, and the in-product onboarding investment was the response to that load.

Wrong two — the entailment verifier was too strict on day one

The Grounded-AI Pledge enforcement is built on a three-gate refusal model: retrieval floor, rewrite-only drafting, and entailment verification. We launched with all three gates calibrated conservatively — the verifier in particular was tuned strict, on the principle that a false-acceptance (passing a sentence that the source does not actually support) is worse than a false-rejection (refusing a sentence that the source does support).

The principle is right. The calibration was wrong.

In the first month of customer use, the verifier was rejecting on the order of 30 to 40 percent of drafted sentences. Many of those rejections were correct — the source genuinely did not support the drafted claim, and the system was doing its job. But a meaningful share were over-strict — paraphrases that were entirely supported, but where the verifier’s surface-level token matching missed the support. The result was that customers were spending too much time accepting rejections that should not have been rejections, and trust in the refusal mechanism was uneven as a result.

What we changed: we shipped a calibration tool that lets a customer set the verifier’s strictness on a sliding scale, with the default moved one notch toward permissive. We also added a “dispute this rejection” mechanism that captures the customer’s reasoning for an override and feeds it back into our internal evaluation set. Over the next quarter we will use those overrides to retrain the verifier on a corpus that includes legitimate paraphrases the original tuning did not handle.

What is still open: numeric tolerances. We err strict on numerics (“99.9%” against a source that says “99.94%” fails). The strictness is right when the buyer cares about exact figures. It is overstrict when the answer is rounding to a published threshold. The customer-controlled tolerance setting is a partial fix; the deeper fix is a context-aware tolerance the verifier infers from the question type. Not yet shipped.

Wrong three — the changelog discipline did not match the engineering velocity

We promised that product changes would land on the relevant /platform/* page first, then on the blog. The pattern works in theory. In practice, the engineering team in Q1 shipped at a pace the marketing-page-update process could not keep up with, and there were two-to-three-week stretches where shipped product features were not yet reflected on the marketing site.

The gap was not large. The features were honest and the marketing pages did not contradict them. But the order — marketing first, blog second — slipped to “shipped first, marketing eventually.” That is the failure mode the launch-post discipline was meant to prevent.

What we changed: we now run a weekly check that reconciles shipped features against the marketing pages. Anything shipped this week that is not yet on the relevant /platform/* page is on a 5-day fix queue. The blog post (when one is warranted) follows the marketing update, not the ship.

What is still open: the cadence is sustainable for the engineering volume we have today. If the engineering velocity increases (which we expect), the reconciliation process will need more automation. We have not yet built that automation. It is on the list.

Wrong four — the Sarah posts on craft were under-frequent

The original cadence allocated Sarah — the composite proposal-veteran voice — one post every four to five days, targeting roughly 75 posts a year. The actual cadence in Q1 was closer to one every six to seven days. The shortfall was about 15 posts.

This sounds like a process complaint and not a strategic mistake. It is a strategic mistake. The Sarah posts are the part of the blog that earns trust with the practitioner audience — the proposal managers, capture leads, and SMEs who are evaluating whether PursuitAgent is built by people who understand the craft. The engineering posts earn trust with the technical audience; the Bo posts (these) earn trust with the buyer audience. Sarah earns trust with the practitioner audience. Under-publishing Sarah weakened our position with the audience that, in the end, has to use the product day to day.

What we changed: the editorial calendar for Q2 commits to a higher Sarah cadence and pre-drafts the next four Sarah posts so the production pipeline does not depend on real-time writing. The reading-an-RFP series and the recent pillar piece on reading the procurement document are the start of the catch-up.

What is still open: the composite-author disclosure. We were honest from day one that Sarah is a composite voice, and every Sarah post carries a footer saying so. The honesty does the right thing on truth-in-byline. It does not fully solve the question of whether a composite voice can build the same kind of trust as a named individual writer. I do not know the answer. We will keep watching how the audience receives those posts.

Wrong five — we under-priced the smallest tier

The public pricing posture committed to specific numbers for the self-serve tier and the mid-tier. The mid-tier number, as I wrote, was somewhat arbitrary but defensible. The self-serve number — I am willing to say now, after ninety days — was set too low.

The unit economics on self-serve were tight at launch. They are not unsustainable, but the margin is thinner than it needs to be, and the pricing does not reflect the engineering and support investment that goes into making the product work for a small team without a procurement budget. A team buying on a credit card is not paying less because the engineering is less; they are paying less because the buying motion is faster.

What we changed: nothing yet. I am not raising the price in Q1, because raising the price ninety days in would communicate the wrong thing to the early self-serve customers who took the leap on the launch posture. The price will move sometime in the next six to nine months, and when it does it will be visible on the page and called out in a post. New signups at the new price; existing customers grandfathered for at least a year.

What is still open: the question of what the right number actually is. I do not have it. We are running internal pricing sensitivity analyses (mostly: looking at expansion behavior, churn, and willingness-to-pay signals from the sales conversations on the next tier up). When the data settles on a number, the post explaining the change will explain how we got to it.

What we got right (a paragraph each)

The retrospective format invites a balanced section on the things that worked. Brief, because this post is mostly about the wrongs.

The published prices. Even with the under-priced self-serve tier, the decision to publish prices at all was correct and is doing the work I expected — qualifying inbound traffic, anchoring competitive negotiations, signaling category-level differentiation. The discomfort I described in the pricing-in-public post is real, but the directional bet held.

The Pledge. The Grounded-AI Pledge is a contractual commitment with teeth (customer can terminate without penalty if we breach it), and the Pledge has not been breached. More than that, the Pledge has been a useful internal forcing function — when we considered shipping a feature that would have made the Pledge harder to enforce, the Pledge made the conversation simple. We did not ship it.

The blog discipline. The voice doc and the cadence have held. We have not, in ninety days, run a post that I would describe as filler. Some posts are short and some posts are speculative, but the discipline against vendor-blog sludge has held. The cadence target was ambitious; we have hit it within a small margin.

The engineering build log. The Engineering team’s posts — chunking, the drafting loop, the cost breakdown, the reranker math — have produced exactly the kind of trust-building artifact I hoped they would. Several inbound conversations in Q1 cited specific engineering posts as the reason the prospect took our call seriously.

What we are watching for in Q2

Three things, in priority order.

Customer retention on the self-serve tier. Self-serve was the riskiest piece of the pricing posture. The first cohort is now ninety days in, and the renewal patterns are starting to be informative. If retention drops materially below what we modeled, we have a structural problem with the self-serve experience that the in-product onboarding fix does not fully address. We will know more by mid-August.

Verifier calibration drift. As we accumulate more customer overrides on the entailment verifier, the calibration data will accumulate. The risk is that the calibration data is biased toward the kinds of questions customers are most willing to dispute, which is not the same as the kinds of questions where the verifier is most often wrong. We are designing the calibration process carefully to avoid the bias. We will write about it when we have something to show.

The win/loss intelligence loop. Pair capture shipped this week. Reflection ships in August. Writeback ships in September. Each piece is bounded; the loop only delivers the compounding promise when all three are running. We are betting heavily on this in Q2 and will say so on the blog as it lands.

The shape of Q2

I am going to keep doing the retrospective every quarter. The format is a forcing function — knowing the post is going to be public makes me write down the wrongs as they happen, instead of editing them out in retrospect. Same as the published prices forced a discipline on the pricing math. Same as the Pledge forced a discipline on the engineering refusal logic. Pre-committing to public accountability is, in the end, the most reliable way I have found to actually behave the way I want to behave.

If that posture stops working, I will say so on this blog before the posture changes.

The first ninety days of public PursuitAgent had real wrongs and real rights. The retrospective is an attempt at honesty about both, in the spirit the launch post promised. If we are getting the spirit wrong, send the email. The address has not changed.

Sources

  1. 1. PursuitAgent — Why we're writing this blog
  2. 2. PursuitAgent — The Grounded-AI Pledge in code
  3. 3. PursuitAgent — Pricing in public, ninety days in

See grounded retrieval in the product.

Start a trial workspace and watch PursuitAgent draft cited answers from the documents you provide.