Building Your First Production AI Workflow in Python (Tutorial)
Most AI tutorials show you how to call the OpenAI API and stop there. Real production AI workflows need five additional layers: schema validation, structured retry, confidence routing, audit logging, and reversible actions. This tutorial walks through building all five for a representative workflow — invoice data extraction — using only Python and the libraries that show up in every real production system we ship.
The workflow we'll build
Input: a PDF invoice or its raw text. Output: structured data (supplier, invoice number, dates, line items with VAT, totals) posted to Xero (or any other destination) as a draft, with a full audit trail.
Stack
- Python 3.12
- openai — official OpenAI SDK
- pydantic v2 — schema validation
- tenacity — retry logic
- structlog — structured logging
- SQLAlchemy 2.x — audit log persistence (or just write to a JSON file for prototyping)
Step 1 — Define the schema
Don't skip this. Pydantic schemas are the contract between the LLM and your business logic. If the LLM returns malformed data, Pydantic catches it before it can do damage.
Pseudocode (real code formatting will vary):
class LineItem(BaseModel): description: str; quantity: int; unit_price_pence: int; vat_rate_pct: int; vat_pence: int; line_total_pence: intclass Invoice(BaseModel): supplier_name: str; invoice_number: str; invoice_date: date; due_date: Optional[date]; line_items: list[LineItem]; subtotal_pence: int; vat_total_pence: int; total_pence: int; currency: str = "GBP"
Critical detail: money in pence as int, not pounds as float. Floats and money don't mix; one rounding error and your accounting reconciliation breaks.
Step 2 — Call the LLM with structured output
Use OpenAI's structured output mode (or Anthropic's tool-use, or Mistral's schema mode). The model returns JSON that matches your schema, or it errors. No more "parse the response and pray".
Pseudocode flow:
- Call
openai.chat.completions.parse(...)withresponse_format=Invoice - OpenAI guarantees the response conforms to your schema, or raises
- Validate it through Pydantic anyway (belt and braces)
- Cross-check: sum of
line_items.line_total_pence+vat_total_penceshould equaltotal_pence
The arithmetic check at step 4 is crucial. LLMs are bad at sums. Compute the totals yourself from the line items; if the LLM's reported totals don't match, log a warning and use the computed values.
Step 3 — Wrap in retry logic
Network errors happen. API rate limits happen. Use tenacity to retry with exponential backoff:
- Retry on transient errors (5xx, network timeouts, rate limit)
- Do NOT retry on schema validation failures (it'll fail the same way)
- Max 3 attempts, exponential backoff (2s, 8s, 30s)
- Log every retry attempt with attempt number + reason
Step 4 — Confidence scoring + routing
Not every successful extraction is equally trustworthy. Compute a confidence score from signals you have:
- OCR confidence (if you're running OCR first — AWS Textract returns per-field confidence)
- Schema validation passed without retry (vs needing a retry)
- Arithmetic check passed (totals match)
- Supplier matches a known supplier in your database (yes = higher confidence)
- Invoice number format looks like prior invoices from this supplier (yes = higher)
- Total amount within typical range for this supplier (yes = higher)
Combine into a single confidence score (0-100). Route based on threshold:
- ≥ 85, known supplier: auto-post to Xero as draft (still requires human approval to post final)
- 60-84: add to bookkeeper review queue with confidence score visible
- < 60 OR new supplier OR amount anomaly: flag for partner attention with explicit reason
Step 5 — Audit logging
Every decision the system makes is written to an append-only audit log BEFORE any side effect. This means:
- Input hash (so you can match the audit entry to the source PDF)
- Raw LLM response (helps debug edge cases later)
- Parsed structure
- Confidence score + breakdown
- Routing decision + reason
- Action taken (post / queue / flag)
- Timestamp + trace ID
When a customer asks "why did the system do X?", you can answer authoritatively. When an audit asks for proof of process, you have it. Without an audit log, you're running blind.
Step 6 — Reversible side effects
Post to Xero as DRAFT, never as posted. Send to Slack for human approval before final action. Schedule emails with 5-minute delay so they can be cancelled. Anything that can't be undone needs a human in the loop.
The pattern: suggest, don't execute. The system does 90% of the work; the human approves the final 10%.
Putting it together — the workflow
Conceptual flow:
- Receive PDF (via email webhook, file upload, scheduled poll)
- Run OCR (Textract or Mistral OCR) if needed; extract text
- Call LLM with structured-output Pydantic schema
- Validate output; retry once with stricter prompt if it fails
- Compute totals from line items; warn if LLM totals disagree
- Compute confidence score from all signals
- Write to audit log (BEFORE any side effect)
- Route based on confidence: auto-draft / review queue / flag
- Send action (Xero API, Slack message, email)
- Notify bookkeeper of any review/flag items via Slack
What you've built
A production-grade AI workflow that:
- Doesn't hallucinate dangerous outputs (schema validation catches)
- Doesn't crash on transient errors (retry logic)
- Can be audited (every decision logged)
- Has a human in the loop for risky decisions (confidence routing)
- Can be undone (drafts, not final actions)
This is the layer between "AI that's impressive in a demo" and "AI that runs unattended overnight in a production accounting practice". The libraries are simple; the pattern is rigorous.
Next steps
- Add unit tests with synthetic invoices covering edge cases
- Build an eval suite that runs nightly with a fixed corpus of test invoices
- Add dashboards (Metabase / Looker Studio) showing volume, confidence distribution, error rate
- Wire to your existing monitoring (Sentry for errors, CloudWatch for metrics)
- Document the runbook for when something goes wrong at 2am
For a full real-world case study of this exact pattern in production, see our invoice OCR automation case study with architecture diagram and ROI math.
Or if you'd rather not build it yourself: we'll build it for you, typically £3,500 fixed-price, payback in 3 months.
Got a workflow you want to talk through?
30 minutes, no pitch. We'll tell you honestly what we'd build — or whether automation isn't right yet.