>

GPT-4 vs Claude vs Open-Source: Picking an LLM for Business Automation

2026-05-19 AI Automation 6 min read Sree Jagatab

"Which LLM should we use?" is the most-asked question in scoping calls. The honest answer is that for 90% of UK business automation work in 2026, the model matters less than the prompt engineering, the retrieval strategy, and the validation layer around it. But there are real differences worth knowing.

GPT-4o (OpenAI) — the default

For most workflows, GPT-4o is the right default. Strong on structured-output adherence, mature API, good tool-use capabilities, broad ecosystem, predictable pricing. Where it wins: anything involving structured extraction (invoices, forms, contracts), classification, summarisation at scale, function calling. Where it doesn't: it sometimes “helpfully” expands its answer beyond what was asked, which costs tokens. Cost: roughly £2.50 per million input tokens, £10 per million output tokens — cheap enough that for 95% of SME workloads, the API bill is a rounding error.

Claude 3.7 Sonnet (Anthropic) — strong for nuance

Where Claude often beats GPT-4o: long-context work (full contracts, multi-document summaries), nuanced editorial work (copy review, customer comms tone-matching), and refusing to make things up when context is missing. Claude is also notably better at following intricate output schemas without drift. Cost similar to GPT-4o. The reason to default to GPT-4o instead is mostly ecosystem maturity and developer tooling — Claude is catching up fast.

Open-source via Bedrock / Together / self-hosted

Llama 3.1 70B and Mistral Large are genuinely capable for most automation tasks. When to choose open-source: (1) data-residency requirements that the closed providers can't meet, (2) cost optimisation at very high volume (millions of requests/month), (3) need for fine-tuning on proprietary data. When to skip: anything else. The operational cost of self-hosting (GPU instances, scaling, monitoring) almost always exceeds the API cost difference at SME volumes.

The decision tree we actually use

  1. Does your industry have data residency requirements the providers can't meet? → open-source via your own AWS/Azure.
  2. Is the workflow long-context (>100k tokens of input)? → Claude.
  3. Is the workflow doing structured extraction or function-calling at scale? → GPT-4o.
  4. Default → GPT-4o, with an evaluation harness in place so you can swap later cheaply.

The thing that actually matters

In our experience, the model choice contributes maybe 15% of project outcome quality. The remaining 85% is: retrieval strategy, prompt engineering, schema-validated output, confidence thresholds, audit logging, and the human-review loop for edge cases. Get those right and you can swap models with minimal disruption when something better lands next quarter (which it will).

See the how we build page for the engineering pattern we use across all model choices.

Sree Jagatab
Sree Jagatab is an AI automation engineer based in Wisbech, Cambridgeshire. He builds custom Python and AI automation for UK SMEs across Cambridge, Peterborough, and the surrounding region. More about Sree →

Got a workflow you want to talk through?

30 minutes, no pitch. We'll tell you honestly what we'd build — or whether automation isn't right yet.