Motivation
Most production incidents in a RAG system are not crashes — they are silent degradations: answers get vaguer, a delete “succeeds” but the doc still shows up, the graph never populates. This page catalogues the failure modes we have actually hit, the signal that distinguishes each, and the fix.Health checks
Start every investigation here.GET /healthz— a liveness probe. Returnsok(HTTP 200) when the app boots. Wire it into your load balancer.- Admin dashboard health panel — backed by
App\Services\Admin\HealthCheckService::run(), it returns a per-subsystem status map:db_ok,pgvector_ok,queue_ok,kb_disk_ok,embedding_provider_ok,chat_provider_ok(eachok | degraded | down), apii_redactorblock (whosestatusisok | degraded | down | disabled—disabledis intentional, not an error), and achecked_attimestamp. The dashboard endpoint (DashboardMetricsController::health()) additionally attaches apii_redactor_configsnapshot for the PII strip.
degraded/down here localises the fault before you read a single log line.
Embedding-dimension gotcha
Symptom: after switching the embeddings provider/model, ingestion crashes on a vector-width mismatch, or retrieval quality collapses because old and new vectors coexist. Cause:KB_EMBEDDINGS_DIMENSIONS and the vector(N) columns on
knowledge_chunks + embedding_cache are a contract. Changing the model’s
output width (OpenAI 1536 → Gemini 768 → Regolo 4096) without migrating breaks
it.
Fix — all four steps, in order:
Resize the vector columns
Create a migration that resizes the
embedding vector(N) column on
knowledge_chunks and embedding_cache to the new N.A delete “succeeds” but the document still appears
Symptom:kb:delete (or the API) reports success, yet search still returns
the doc — or reports “not found” for a doc you can see.
Cause: path divergence. The ingest and delete flows must produce the
identical normalised source_path. Every path is funnelled through
App\Support\KbPath::normalize() (collapses //, converts \, rejects ./..).
A hand-built path that skips it will not match the stored row.
Fix: delete by the same path the ingest used. For force-deleting a
previously soft-deleted row, the write path must query withTrashed() — a
default-scoped query hides trashed rows and the delete becomes a silent no-op.
Spurious refusals (“I don’t have enough context”)
Symptom: the assistant refuses on questions you know are covered. Cause: the refusal gate thresholds are too strict for your corpus, or retrieval genuinely isn’t surfacing the right chunks. Fix: inspect the chat-logmeta (the refusal carries a machine-readable
refusal_reason). Then tune, in config/kb.php:
The graph is empty / expansion does nothing
Symptom:kb_nodes / kb_edges are empty; graph expansion never fires.
Cause: this is by design when a tenant has no canonical docs.
GraphExpander::expand() and RejectedApproachInjector::pick() both degrade to
no-ops — plain hybrid RAG behaviour.
Fix: canonicalise some docs (frontmatter with doc_id + slug + type),
then php artisan kb:rebuild-graph. Confirm KB_GRAPH_EXPANSION_ENABLED=true.
See canonical & promotion and the
architecture overview.
Ingestion never completes
Symptom:POST /api/kb/ingest returns 202 but documents never become
searchable.
Cause: no queue worker is processing kb-ingest, or QUEUE_CONNECTION=sync
in an environment that expects async.
Fix:
IngestDocumentJob runs with $tries=3 and backoff [10, 30, 60] — a
transient provider error self-heals; a persistent one lands in failed_jobs
(visible in the admin log viewer’s Failed Jobs tab).
”My setting did nothing”
Cause: a cached config. Production shouldconfig:cache, but that means
live .env edits are ignored until you clear it.
Fix:
Gotchas & operations
- Read the subsystem health map first — it localises DB / pgvector / queue / disk / provider faults before logs.
- Failures should be loud. Endpoints map failure → correct status (404 / 500 / 503), never 200-with-empty-body. If you see a 200 with an empty/null body, that is a bug to report, not expected behaviour.
- The dimension gotcha is the #1 self-host footgun. All four steps, in order, every time you change the embeddings model.
Configuration
The thresholds and retention knobs referenced above.
Scheduler & maintenance
The retention sweeps that keep storage bounded.