Database schema

Motivation

The schema is the architecture, made durable. Every invariant the system relies on — idempotency, canonical uniqueness, referential integrity of the graph, the immutable audit trail, cross-tenant isolation — is anchored in a column, an index, or a constraint. This page is the reference-grade map of those tables. Where a constraint encodes a decision, the decision is named.

Theory & background

Three design rules govern the schema:

Source-of-truth vs projection. Canonical markdown is authoritative; the tables are a rebuildable projection. The one exception is kb_canonical_audit — an immutable forensic trail that must survive hard deletes, so it carries no FK to knowledge_documents.
Tenant-scoped uniqueness. Canonical identifiers (slug, doc_id) and graph UIDs are unique per project, never globally — two tenants/projects may legitimately share dec-cache-v2.
Contract columns. The vector(N) width is part of the embeddings contract; version_hash is the idempotency anchor; tenant_id is the isolation key.

Design

Data model / contract

`knowledge_documents`

Core columns: id, tenant_id (default default, indexed), project_key, source_type, title, source_path, mime_type, language (default it), access_scope (default internal), status (default active), document_hash (SHA-256 of bytes), version_hash (SHA-256 of source content), metadata JSON, source_updated_at, indexed_at, timestamps, deleted_at (soft delete). Canonical columns (nullable): doc_id, slug, canonical_type, canonical_status, is_canonical (default false), retrieval_priority (smallint 0–100, default 50), source_of_truth (default true), frontmatter_json. Auto-Wiki columns: generation_source (default human), evidence_tier (nullable). Uniqueness: (project_key, source_path, version_hash) (the idempotency anchor), (project_key, doc_id) = uq_kb_doc_doc_id, (project_key, slug) = uq_kb_doc_slug.

`knowledge_chunks`

id, knowledge_document_id FK (ON DELETE CASCADE), project_key, chunk_order (0 = summary), chunk_hash, heading_path, chunk_text, metadata JSON, embedding vector(N) (text on non-pgsql). Unique: (knowledge_document_id, chunk_hash) = uq_kb_chunk_doc_hash. Index: GIN on to_tsvector(<lang>, chunk_text) (pgsql only).

`embedding_cache`

id, text_hash (SHA-256), provider, model, embedding vector(N), created_at, last_used_at (LRU). Unique: (text_hash, provider, model).

embedding_cache is intentionally NOT tenant-scoped — it is a cross-tenant reuse layer (the same text+provider+model embeds once across all tenants). This exclusion is documented in TenantIdMandatoryTest. See security & threat model.

`chat_logs`

id, tenant_id, session_id, user_id?, question, answer, project_key?, ai_provider, ai_model, chunks_count, sources JSON, prompt_tokens / completion_tokens / total_tokens, latency_ms, grounding_signal?, grounding_details? JSON, client_ip?, user_agent?, extra JSON, created_at.

`conversations` / `messages`

User-scoped chat history. conversations: tenant_id, user_id FK (cascade), title?, project_key, with index (user_id, updated_at). messages: tenant_id, conversation_id FK (cascade), role, content, metadata JSON (citations + provider/model telemetry + grounding), created_at.

`kb_nodes` / `kb_edges`

The typed graph — see the canonical graph page for the 9 node types, 10 edge types, weights, provenance, and the project-scoped composite FKs. Uniqueness: (project_key, node_uid) and (project_key, edge_uid).

`kb_canonical_audit`

id, tenant_id, project_key, doc_id?, slug?, event_type (promoted | updated | deprecated | superseded | rejected_injection_used | graph_rebuild), actor (user id / command / system), before_json, after_json, metadata_json, created_at. No updated_at (rows are never mutated) and no FK to knowledge_documents (survives hard deletes for forensics).

Decision rationale (ADR-style)

Why no FK on the audit table? It must outlive the documents it records. An FK with cascade would erase the very forensic trail compliance depends on.
Why composite uniques instead of a global slug? Tenant/project sharing of intuitive slugs is a feature, not a collision (ADR 0001, ADR 0002).
Why tenant_id on every domain table (except the cache)? Cross-tenant isolation is a GDPR-grade invariant; the cache is the one deliberate exclusion for reuse economics, gated by an architecture test.
Why a separate version_hash and document_hash? document_hash tracks the original bytes; version_hash (source content) is the idempotency anchor. Separating them lets binary sources (PDF/DOCX) re-convert without false version churn.

Worked example

-- the idempotency anchor in action
SELECT id, version_hash, deleted_at
FROM knowledge_documents
WHERE project_key = 'handbook' AND source_path = 'onboarding.md'
ORDER BY id;            -- old (archived) + new versions both present

-- a chunk's vector + FTS coexist
SELECT chunk_order, heading_path, (embedding IS NOT NULL) AS has_vec
FROM knowledge_chunks
WHERE knowledge_document_id = 4213
ORDER BY chunk_order;

Gotchas & operations

vector(N) is fixed at migrate time. Changing the embeddings model width needs a resize migration + cache flush + re-index (gotcha).
Soft delete is the default read scope. Any write/admin query that must see trashed rows uses withTrashed() / onlyTrashed().
Tests run on SQLite, where vector(N) becomes JSON text — driver-specific behaviour (the GIN index, pgvector operators) is pgsql-only.

Ingestion pipeline

What writes these tables.

Security & threat model

The tenant_id isolation invariant.

Get Started

Guides

Best Practices

Integrations

Configuration & Operations

Architecture

Reference

Motivation

Theory & background

Design

Data model / contract

`knowledge_documents`

`knowledge_chunks`

`embedding_cache`

`chat_logs`

`conversations` / `messages`

`kb_nodes` / `kb_edges`

`kb_canonical_audit`

Decision rationale (ADR-style)

Worked example

Gotchas & operations

Ingestion pipeline

Security & threat model

​Motivation

​Theory & background

​Design

​Data model / contract

​knowledge_documents

​knowledge_chunks

​embedding_cache

​chat_logs

​conversations / messages

​kb_nodes / kb_edges

​kb_canonical_audit

​Decision rationale (ADR-style)

​Worked example

​Gotchas & operations

Ingestion pipeline

Security & threat model

Motivation

Theory & background

Design

Data model / contract

`knowledge_documents`

`knowledge_chunks`

`embedding_cache`

`chat_logs`

`conversations` / `messages`

`kb_nodes` / `kb_edges`

`kb_canonical_audit`

Decision rationale (ADR-style)

Worked example

Gotchas & operations