Motivation
Ingestion is the front door of the knowledge base. It must be idempotent (re-pushing the same bytes is a no-op), versioned (a new version archives the old so stale chunks never surface), and uniform (every document — however it arrives — gets the same chunking, embedding, and canonical handling).Two entry points, one execution path
Whatever the source, everything converges onIngestDocumentJob → DocumentIngestor.
Never add a third path that skips this — idempotency, canonical handling, and
graph indexing all live here.
CLI ingestion
kb:ingest-folder flags: --project=, --tenant= (restrict to one tenant),
--disk= (override KB_FILESYSTEM_DISK), --pattern= (comma-separated
extensions), --recursive, --sync (run inline, skip the queue). The {path}
is resolved relative to KB_PATH_PREFIX, because the queued job re-applies
the prefix when reading — any new disk-walking command must honour the prefix or
reject absolute paths.
HTTP ingestion
IngestDocumentJob per document. Binary formats
(application/pdf, .docx) require content base64-encoded with a
mime_type; the controller decodes-or-422 before writing.
Idempotency (the load-bearing invariant)
DocumentIngestor hashes the markdown and upserts on
(project_key, source_path, version_hash). Consequences:
- Re-pushing identical bytes → no-op.
- New bytes at the same
source_path→ a new version; the previous version’s chunks are archived so stale content never surfaces in retrieval. - Do not add
firstOrCreatelogic that bypasses the hash.
Chunking and embeddings
MarkdownChunker is an in-house, line-based, fence-aware FSM that splits on
section boundaries (no external markdown library). Each chunk is embedded; the
embedding_cache (keyed by the composite UNIQUE(text_hash, provider, model))
means identical text embeds once and is reused — across tenants, by design.
The canonical branch
When the markdown carries valid YAML frontmatter,CanonicalParser validates it
against the 9 types / 6 statuses and populates the 8 canonical columns +
frontmatter_json. Prior versions’ canonical identifiers are vacated first (so
the per-project composite uniques accept the new version), and after commit
CanonicalIndexerJob populates kb_nodes + kb_edges. Invalid frontmatter
degrades gracefully to non-canonical ingestion. See
Canonical & promotion.
Deletion mirrors ingestion
kb:delete / DELETE /api/kb/documents / --prune-orphans / the scheduled
kb:prune-deleted all fan in to a single DocumentDeleter (soft by default;
hard delete cascades the graph). Soft delete is the default
(KB_SOFT_DELETE_ENABLED=true); rows hide from every read path until
kb:prune-deleted hard-deletes them after the retention window.
Gotchas & operations
- Use
App\Support\KbPath::normalize()for every source path — ingest and delete must produce identical paths or deletes emit spurious “not found”. - Bulk sweeps must be memory-safe (
chunkById/cursor, filters pushed into SQL) — never load the whole corpus into PHP. --syncis for debugging; production ingestion runs through the queue (php artisan queue:work).
Canonical & promotion
Turn an ingested draft into a human-vouched canonical artifact.
Chat & retrieval
How ingested chunks become grounded, cited answers.