Building Your First Production AI Workflow in Python (Tutorial)

2026-05-19 AI Automation 13 min read Sree Jagatab

Most AI tutorials show you how to call the OpenAI API and stop there. Real production AI workflows need five additional layers: schema validation, structured retry, confidence routing, audit logging, and reversible actions. This tutorial walks through building all five for a representative workflow — invoice data extraction — using only Python and the libraries that show up in every real production system we ship.

The workflow we'll build

Input: a PDF invoice or its raw text. Output: structured data (supplier, invoice number, dates, line items with VAT, totals) posted to Xero (or any other destination) as a draft, with a full audit trail.

Stack

Python 3.12
openai — official OpenAI SDK
pydantic v2 — schema validation
tenacity — retry logic
structlog — structured logging
SQLAlchemy 2.x — audit log persistence (or just write to a JSON file for prototyping)

Step 1 — Define the schema

Don't skip this. Pydantic schemas are the contract between the LLM and your business logic. If the LLM returns malformed data, Pydantic catches it before it can do damage.

Pseudocode (real code formatting will vary):

class LineItem(BaseModel): description: str; quantity: int; unit_price_pence: int; vat_rate_pct: int; vat_pence: int; line_total_pence: int
class Invoice(BaseModel): supplier_name: str; invoice_number: str; invoice_date: date; due_date: Optional[date]; line_items: list[LineItem]; subtotal_pence: int; vat_total_pence: int; total_pence: int; currency: str = "GBP"

Critical detail: money in pence as int, not pounds as float. Floats and money don't mix; one rounding error and your accounting reconciliation breaks.

Step 2 — Call the LLM with structured output

Use OpenAI's structured output mode (or Anthropic's tool-use, or Mistral's schema mode). The model returns JSON that matches your schema, or it errors. No more "parse the response and pray".

Pseudocode flow:

Call openai.chat.completions.parse(...) with response_format=Invoice
OpenAI guarantees the response conforms to your schema, or raises
Validate it through Pydantic anyway (belt and braces)
Cross-check: sum of line_items.line_total_pence + vat_total_pence should equal total_pence

The arithmetic check at step 4 is crucial. LLMs are bad at sums. Compute the totals yourself from the line items; if the LLM's reported totals don't match, log a warning and use the computed values.

Step 3 — Wrap in retry logic

Network errors happen. API rate limits happen. Use tenacity to retry with exponential backoff:

Retry on transient errors (5xx, network timeouts, rate limit)
Do NOT retry on schema validation failures (it'll fail the same way)
Max 3 attempts, exponential backoff (2s, 8s, 30s)
Log every retry attempt with attempt number + reason

Step 4 — Confidence scoring + routing

Not every successful extraction is equally trustworthy. Compute a confidence score from signals you have:

OCR confidence (if you're running OCR first — AWS Textract returns per-field confidence)
Schema validation passed without retry (vs needing a retry)
Arithmetic check passed (totals match)
Supplier matches a known supplier in your database (yes = higher confidence)
Invoice number format looks like prior invoices from this supplier (yes = higher)
Total amount within typical range for this supplier (yes = higher)

Combine into a single confidence score (0-100). Route based on threshold:

≥ 85, known supplier: auto-post to Xero as draft (still requires human approval to post final)
60-84: add to bookkeeper review queue with confidence score visible
< 60 OR new supplier OR amount anomaly: flag for partner attention with explicit reason

Step 5 — Audit logging

Every decision the system makes is written to an append-only audit log BEFORE any side effect. This means:

Input hash (so you can match the audit entry to the source PDF)
Raw LLM response (helps debug edge cases later)
Parsed structure
Confidence score + breakdown
Routing decision + reason
Action taken (post / queue / flag)
Timestamp + trace ID

When a customer asks "why did the system do X?", you can answer authoritatively. When an audit asks for proof of process, you have it. Without an audit log, you're running blind.

Step 6 — Reversible side effects

Post to Xero as DRAFT, never as posted. Send to Slack for human approval before final action. Schedule emails with 5-minute delay so they can be cancelled. Anything that can't be undone needs a human in the loop.

The pattern: suggest, don't execute. The system does 90% of the work; the human approves the final 10%.

Putting it together — the workflow

Conceptual flow:

Receive PDF (via email webhook, file upload, scheduled poll)
Run OCR (Textract or Mistral OCR) if needed; extract text
Call LLM with structured-output Pydantic schema
Validate output; retry once with stricter prompt if it fails
Compute totals from line items; warn if LLM totals disagree
Compute confidence score from all signals
Write to audit log (BEFORE any side effect)
Route based on confidence: auto-draft / review queue / flag
Send action (Xero API, Slack message, email)
Notify bookkeeper of any review/flag items via Slack

What you've built

A production-grade AI workflow that:

Doesn't hallucinate dangerous outputs (schema validation catches)
Doesn't crash on transient errors (retry logic)
Can be audited (every decision logged)
Has a human in the loop for risky decisions (confidence routing)
Can be undone (drafts, not final actions)

This is the layer between "AI that's impressive in a demo" and "AI that runs unattended overnight in a production accounting practice". The libraries are simple; the pattern is rigorous.

Next steps

Add unit tests with synthetic invoices covering edge cases
Build an eval suite that runs nightly with a fixed corpus of test invoices
Add dashboards (Metabase / Looker Studio) showing volume, confidence distribution, error rate
Wire to your existing monitoring (Sentry for errors, CloudWatch for metrics)
Document the runbook for when something goes wrong at 2am

For a full real-world case study of this exact pattern in production, see our invoice OCR automation case study with architecture diagram and ROI math.

Or if you'd rather not build it yourself: we'll build it for you, typically £3,500 fixed-price, payback in 3 months.

Sree Jagatab is an AI automation engineer based in Wisbech, Cambridgeshire. He builds custom Python and AI automation for UK SMEs across Cambridge, Peterborough, and the surrounding region. More about Sree →

Got a workflow you want to talk through?

30 minutes, no pitch. We'll tell you honestly what we'd build — or whether automation isn't right yet.

WhatsApp Sree 07864 880790