GPT-4 vs Claude vs Open-Source: Picking an LLM for Business Automation
"Which LLM should we use?" is the most-asked question in scoping calls. The honest answer is that for 90% of UK business automation work in 2026, the model matters less than the prompt engineering, the retrieval strategy, and the validation layer around it. But there are real differences worth knowing.
GPT-4o (OpenAI) — the default
For most workflows, GPT-4o is the right default. Strong on structured-output adherence, mature API, good tool-use capabilities, broad ecosystem, predictable pricing. Where it wins: anything involving structured extraction (invoices, forms, contracts), classification, summarisation at scale, function calling. Where it doesn't: it sometimes “helpfully” expands its answer beyond what was asked, which costs tokens. Cost: roughly £2.50 per million input tokens, £10 per million output tokens — cheap enough that for 95% of SME workloads, the API bill is a rounding error.
Claude 3.7 Sonnet (Anthropic) — strong for nuance
Where Claude often beats GPT-4o: long-context work (full contracts, multi-document summaries), nuanced editorial work (copy review, customer comms tone-matching), and refusing to make things up when context is missing. Claude is also notably better at following intricate output schemas without drift. Cost similar to GPT-4o. The reason to default to GPT-4o instead is mostly ecosystem maturity and developer tooling — Claude is catching up fast.
Open-source via Bedrock / Together / self-hosted
Llama 3.1 70B and Mistral Large are genuinely capable for most automation tasks. When to choose open-source: (1) data-residency requirements that the closed providers can't meet, (2) cost optimisation at very high volume (millions of requests/month), (3) need for fine-tuning on proprietary data. When to skip: anything else. The operational cost of self-hosting (GPU instances, scaling, monitoring) almost always exceeds the API cost difference at SME volumes.
The decision tree we actually use
- Does your industry have data residency requirements the providers can't meet? → open-source via your own AWS/Azure.
- Is the workflow long-context (>100k tokens of input)? → Claude.
- Is the workflow doing structured extraction or function-calling at scale? → GPT-4o.
- Default → GPT-4o, with an evaluation harness in place so you can swap later cheaply.
The thing that actually matters
In our experience, the model choice contributes maybe 15% of project outcome quality. The remaining 85% is: retrieval strategy, prompt engineering, schema-validated output, confidence thresholds, audit logging, and the human-review loop for edge cases. Get those right and you can swap models with minimal disruption when something better lands next quarter (which it will).
See the how we build page for the engineering pattern we use across all model choices.
Got a workflow you want to talk through?
30 minutes, no pitch. We'll tell you honestly what we'd build — or whether automation isn't right yet.