Motivation
The schema is the architecture, made durable. Every invariant the system relies on — idempotency, canonical uniqueness, referential integrity of the graph, the immutable audit trail, cross-tenant isolation — is anchored in a column, an index, or a constraint. This page is the reference-grade map of those tables. Where a constraint encodes a decision, the decision is named.Theory & background
Three design rules govern the schema:- Source-of-truth vs projection. Canonical markdown is authoritative; the
tables are a rebuildable projection. The one exception is
kb_canonical_audit— an immutable forensic trail that must survive hard deletes, so it carries no FK toknowledge_documents. - Tenant-scoped uniqueness. Canonical identifiers (
slug,doc_id) and graph UIDs are unique per project, never globally — two tenants/projects may legitimately sharedec-cache-v2. - Contract columns. The
vector(N)width is part of the embeddings contract;version_hashis the idempotency anchor;tenant_idis the isolation key.
Design
Data model / contract
knowledge_documents
Core columns: id, tenant_id (default default, indexed), project_key,
source_type, title, source_path, mime_type, language (default it),
access_scope (default internal), status (default active),
document_hash (SHA-256 of bytes), version_hash (SHA-256 of source content),
metadata JSON, source_updated_at, indexed_at, timestamps, deleted_at
(soft delete).
Canonical columns (nullable): doc_id, slug, canonical_type,
canonical_status, is_canonical (default false), retrieval_priority
(smallint 0–100, default 50), source_of_truth (default true),
frontmatter_json. Auto-Wiki columns: generation_source (default human),
evidence_tier (nullable).
Uniqueness: (project_key, source_path, version_hash) (the idempotency
anchor), (project_key, doc_id) = uq_kb_doc_doc_id, (project_key, slug) =
uq_kb_doc_slug.
knowledge_chunks
id, knowledge_document_id FK (ON DELETE CASCADE), project_key,
chunk_order (0 = summary), chunk_hash, heading_path, chunk_text,
metadata JSON, embedding vector(N) (text on non-pgsql). Unique:
(knowledge_document_id, chunk_hash) = uq_kb_chunk_doc_hash. Index: GIN on
to_tsvector(<lang>, chunk_text) (pgsql only).
embedding_cache
id, text_hash (SHA-256), provider, model, embedding vector(N),
created_at, last_used_at (LRU). Unique: (text_hash, provider, model).
chat_logs
id, tenant_id, session_id, user_id?, question, answer,
project_key?, ai_provider, ai_model, chunks_count, sources JSON,
prompt_tokens / completion_tokens / total_tokens, latency_ms,
grounding_signal?, grounding_details? JSON, client_ip?, user_agent?,
extra JSON, created_at.
conversations / messages
User-scoped chat history. conversations: tenant_id, user_id FK
(cascade), title?, project_key, with index (user_id, updated_at).
messages: tenant_id, conversation_id FK (cascade), role, content,
metadata JSON (citations + provider/model telemetry + grounding), created_at.
kb_nodes / kb_edges
The typed graph — see the canonical graph page
for the 9 node types, 10 edge types, weights, provenance, and the
project-scoped composite FKs. Uniqueness: (project_key, node_uid) and
(project_key, edge_uid).
kb_canonical_audit
id, tenant_id, project_key, doc_id?, slug?, event_type (promoted |
updated | deprecated | superseded | rejected_injection_used | graph_rebuild),
actor (user id / command / system), before_json, after_json,
metadata_json, created_at. No updated_at (rows are never mutated) and
no FK to knowledge_documents (survives hard deletes for forensics).
Decision rationale (ADR-style)
- Why no FK on the audit table? It must outlive the documents it records. An FK with cascade would erase the very forensic trail compliance depends on.
- Why composite uniques instead of a global slug? Tenant/project sharing of intuitive slugs is a feature, not a collision (ADR 0001, ADR 0002).
- Why
tenant_idon every domain table (except the cache)? Cross-tenant isolation is a GDPR-grade invariant; the cache is the one deliberate exclusion for reuse economics, gated by an architecture test. - Why a separate
version_hashanddocument_hash?document_hashtracks the original bytes;version_hash(source content) is the idempotency anchor. Separating them lets binary sources (PDF/DOCX) re-convert without false version churn.
Worked example
Gotchas & operations
vector(N)is fixed at migrate time. Changing the embeddings model width needs a resize migration + cache flush + re-index (gotcha).- Soft delete is the default read scope. Any write/admin query that must see
trashed rows uses
withTrashed()/onlyTrashed(). - Tests run on SQLite, where
vector(N)becomes JSON text — driver-specific behaviour (the GIN index, pgvector operators) is pgsql-only.
Ingestion pipeline
What writes these tables.
Security & threat model
The tenant_id isolation invariant.