KB schema evolution, year one

A year of building the knowledge-base blocks table has produced four shipped schema migrations and three that got rolled back. The shipped migrations are the more interesting ones. The rollbacks taught us more.

This is a field report — what we changed, why we changed it, and what the current schema looks like.

The starting point

The original kb_blocks table from late 2025:

create table kb_blocks (
  id            uuid primary key default gen_random_uuid(),
  company_id    uuid not null references companies(id),
  content       text not null,
  embedding     vector(1536),
  source_doc_id uuid references source_documents(id),
  created_at    timestamptz not null default now(),
  updated_at    timestamptz not null default now()
);

Six columns. The first hundred blocks fit fine. The first thousand exposed the missing pieces.

Migration 1 — block versioning (shipped)

Three weeks in, the first real customer asked: when an SME edits a block, what happens to the proposals that already cited the old version?

The answer in the original schema was: the proposal silently cites the new content as if it had been there all along. That’s a citation-integrity violation. We shipped block versioning the next sprint:

alter table kb_blocks
  add column version       integer not null default 1,
  add column previous_id   uuid references kb_blocks(id),
  add column is_current    boolean not null default true;

create index kb_blocks_current_idx on kb_blocks (company_id, is_current)
  where is_current = true;

Edits create a new row with version = previous.version + 1, previous_id pointing at the old version, and is_current = true; the old row gets is_current = false. Citations point at the specific row, not at a logical block-id. The content-block-versioning post covers the user-facing implications.

This migration shipped and stayed. It is the load-bearing change of the year.

Migration 2 — the chunk-id column we rolled back

Around month four, we wanted to track which chunks within a block had been retrieved historically. We added a chunk_ids jsonb column with a list of references to a kb_chunks table.

The kb_chunks table broke kb_blocks apart at retrieval-chunk granularity. The idea was that future analytics could ask “which chunk got retrieved most often” and inform freshness scoring.

We rolled it back six weeks later. Three reasons:

The cardinality blew up. A single block produced 3–8 chunks; the chunks table grew faster than the blocks table by a factor that made our embedding-cost forecasting wrong.
The retrieval pipeline didn’t actually want it. Reranking happens at block granularity; chunks are an indexing detail. We were exposing implementation in the schema, not modeling user intent.
The freshness signal we wanted didn’t need chunks. Freshness was a per-block question; we got the same signal by tracking last-retrieved on the blocks table directly.

The rollback was clean because we had not yet exposed chunk_ids to any application code outside the retrieval module. Lesson: schema changes that expose internal pipeline structure are tempting and almost always wrong. Model the user’s question, not the pipeline’s intermediate state.

Migration 3 — provenance fields (shipped)

Month five. The provenance question — who wrote this block, who approved it, what document did it come from — was being answered by a sprawl of joins across users, source_documents, and a half-built approvals table.

We collapsed the most-asked provenance into kb_blocks directly:

alter table kb_blocks
  add column author_id          uuid references users(id),
  add column approver_id        uuid references users(id),
  add column approved_at        timestamptz,
  add column extraction_source  text,
  add column extraction_run_id  uuid;

author_id is the human who created or last edited the block. approver_id and approved_at are nullable until the block enters an approved state. extraction_source is one of manual, llamaparse, adobe, web-paste. extraction_run_id joins back to the extraction-pipeline run if the block was machine-extracted, for traceability.

Shipped and stayed. The answer provenance graph post covers what we built on top of these columns.

Migration 4 — the `tags` text[] column we rolled back

Month seven. A taxonomy attempt. We added tags text[] to allow per-block tagging, with a UI that let SMEs assign tags like security, compliance, pricing.

It rolled back four weeks later. The tags drifted into noise inside a month. SMEs tagged inconsistently (security, Security, sec, security-policy). Search over tags became unreliable. Retrieval improved by zero. The UI got cluttered.

We replaced it with a controlled-vocabulary category_id column referencing a small kb_categories table that the company can edit but only with admin permissions. Categories are 12–20 per company, hand-managed. Search treats categories as a faceted filter, not a primary signal. Retrieval is unchanged.

Lesson: free-form user-driven taxonomy in a B2B tool is almost never worth the migration. Either the taxonomy is controlled (small, curated, admin-edited) or it isn’t a taxonomy.

Migration 5 — freshness scoring fields (shipped)

Month nine. The freshness alerts feature needed dedicated columns:

alter table kb_blocks
  add column last_retrieved_at  timestamptz,
  add column last_used_in       uuid references proposals(id),
  add column expires_at         timestamptz,
  add column freshness_score    numeric(3, 2);

last_retrieved_at is updated on every retrieval (cheap; we batch the writes). last_used_in is updated when a block is cited in a shipped proposal — a stronger signal than retrieval. expires_at is set manually for blocks with known shelf life (compliance attestations, certifications). freshness_score is computed nightly from those signals.

Shipped, stayed, currently load-bearing.

Migration 6 — the embedding-version column we rolled back

Month ten. We were about to ship a new embedding model and wanted to track which blocks had been re-embedded. We added embedding_version to the table.

We rolled it back two weeks later. The embedding-version question is more naturally answered at the index level, not the block level. We have a single embedding-pipeline run that owns the index; the run’s metadata records which embedder generated the vectors. Asking each row “what embedder were you generated by” was redundant with that pipeline-level metadata, and it made backfills more expensive because every row had to be touched on every embedding upgrade.

The replacement is a kb_index_runs table that records embedding-model, run timestamp, and a pointer to the index version. Blocks reference the run if they need to; the join is rare.

Lesson: don’t denormalize information that lives naturally at a higher granularity. The granularity question is the schema question.

Migration 7 — per-block permissions (shipped)

Month eleven. Enterprise customers needed per-block access control: a block containing pricing terms shouldn’t be visible to the security analyst writing the SOC 2 questionnaire. The per-block permissions feature covers the UX.

create table kb_block_permissions (
  block_id   uuid not null references kb_blocks(id),
  role       text not null,
  permission text not null check (permission in ('read', 'write', 'approve')),
  primary key (block_id, role, permission)
);

A separate table, not columns on kb_blocks, because permissions are sparse: 80% of blocks have the default permission set, and we did not want to denormalize that into every row. The retrieval pipeline filters by user role at query time via a join; the join is fast because the table is well-indexed and the role-set is small.

Shipped and stayed.

What `kb_blocks` looks like now

Roughly:

create table kb_blocks (
  id                 uuid primary key default gen_random_uuid(),
  company_id         uuid not null references companies(id),
  content            text not null,
  embedding          vector(1536),
  source_doc_id      uuid references source_documents(id),

  -- versioning
  version            integer not null default 1,
  previous_id        uuid references kb_blocks(id),
  is_current         boolean not null default true,

  -- provenance
  author_id          uuid references users(id),
  approver_id        uuid references users(id),
  approved_at        timestamptz,
  extraction_source  text,
  extraction_run_id  uuid,

  -- taxonomy (replaced free-form tags)
  category_id        uuid references kb_categories(id),

  -- freshness
  last_retrieved_at  timestamptz,
  last_used_in       uuid references proposals(id),
  expires_at         timestamptz,
  freshness_score    numeric(3, 2),

  created_at         timestamptz not null default now(),
  updated_at         timestamptz not null default now()
);

Sixteen columns. Two reference tables (kb_categories, kb_block_permissions). One pipeline-metadata table (kb_index_runs). The HNSW index on embedding is created out-of-band via SQL after db:push, because Drizzle’s migration generator doesn’t yet emit pgvector index DDL the way we want it. We have a known-issue note for that.

Three rules that came out of the year

Model the user’s question, not the pipeline’s intermediate state. The chunk-id rollback and the embedding-version rollback both violated this rule.
Free-form taxonomy is debt. If users need to tag, controlled vocabulary or no vocabulary.
Permissions go in their own table. Sparseness, scale, and audit needs all push the right way.

The schema isn’t done. We expect at least two more migrations in the next six months — one for cross-block claim composition (still in research) and one for richer source-document provenance. We will probably ship one we end up rolling back. That’s the rate.

The takeaway: a year of schema work produced four shipped migrations, three rollbacks, and a much healthier mental model for what belongs in kb_blocks and what doesn’t. The rollbacks were not failures. They were the cheapest way to learn what shape the data wants to be.