How We Engineer Around LLM Hallucination in Production
Every LLM hallucinates sometimes. The "just how AI is" framing is a cop-out — production AI systems can be reliable, but only if you engineer for the failure modes rather than wish them away. These are the five patterns we apply to every project.
1. Schema-validated outputs
Never accept free-form LLM output for anything that triggers a side effect. Every model call returns JSON that parses against a Pydantic (Python) or Zod (TypeScript) schema. If parsing fails, the system retries with a stricter prompt — and if that fails, the request escalates to a human review queue instead of silently failing or making up plausible data.
This catches the entire category of "model hallucinated a field name" or "model returned text when JSON was required". Costs maybe 10 lines of code per workflow, prevents maybe 95% of production hallucination incidents.
2. Retrieval-Augmented Generation (RAG) for any factual work
If the answer should come from your company's knowledge — docs, policies, historical responses — the LLM should be answering from grounded retrieval, not its training data. The system prompt explicitly states: “Answer only from the provided context. If the context does not support an answer, say so.”
Combined with a citation requirement ("for each claim, state which context chunk supports it"), this catches the rest of factual hallucination. The LLM physically cannot make things up if it can't cite a source.
3. Confidence routing
Every AI decision in our systems gets a confidence score (often derived from logprobs, sometimes from a self-eval prompt). The score determines the route: high confidence + known pattern → auto-act. Mid confidence → human review queue with the AI's suggestion. Low confidence or unrecognised pattern → flagged with reason, never auto-acted.
The threshold is tunable per workflow. For invoice processing we typically start at 85 — well-calibrated bookkeepers find 5% of auto-posted invoices need correction; we tune up or down based on that feedback.
4. Append-only audit log of every decision
Before any side effect, the system writes to an audit log: input, intermediate AI output, parsed structure, confidence, routing decision. Append-only, queryable. When a customer says “why did your system do this?”, you can answer authoritatively and in detail. When you're investigating a bad decision, you have the full trail.
This is also the dataset for tuning. Patterns of bad decisions surface in the audit log first; the fix is usually a prompt tweak or threshold adjustment, not a code rewrite.
5. Reversible side effects where possible
Wherever the workflow allows: post drafts, not finals. Suggest, not execute. Schedule with delay, not immediate. Real-world undo for AI decisions is the difference between “customer noticed before any damage” and “customer noticed after the email went out”.
Xero supports draft bills before posting. Outlook supports scheduled-send with delay. Slack supports approval flows before action. Use every one of these where the workflow allows. The 5-minute delay between AI decision and irrevocable action catches the rare bad output the previous four layers missed.
What this looks like in practice
A typical invoice-processing workflow: incoming PDF → OCR + LLM extraction → JSON-validated against schema (layer 1) → embedding lookup of historical supplier data for grounding (layer 2) → confidence routing decides auto-post vs review queue (layer 3) → audit log written before any Xero action (layer 4) → Xero draft, not direct post (layer 5). Five layers, each catching different failure modes.
Result: hallucination is a non-event. Errors still happen but they're caught, logged, and recovered without customer impact. The system is reliable enough to leave running unattended overnight.
Got a workflow you want to talk through?
30 minutes, no pitch. We'll tell you honestly what we'd build — or whether automation isn't right yet.