GPT-4 Customer Support Chatbot for a Cambridge Service Business
24/7 tier-1 support agent grounded in the company's own docs. End-to-end representative engagement.
- Sector
- Cambridge service business (~30 employees)
- Duration
- 3 weeks build + 1 week tuning
- Budget band
- £4,800 fixed-price build + £80/month run
1. The problem
A Cambridge service business with ~30 employees was fielding ~150 customer support emails per week. Around 70% were repeat questions answerable from existing FAQ, knowledge base, and Zendesk reply history. Support team was small (3 people), and out-of-hours emails piled up — customer feedback consistently mentioned slow response time. The business had tried generic "ChatGPT for support" widgets; they'd either hallucinated incorrect answers (and customers noticed) or refused to engage with anything specific to the business.
2. Stack
3. Architecture
System topology — what runs where, what talks to what:
+-----------------+
| Customer |
| on website |
+--------+--------+
|
v
+--------------------+ +---------------------+
| React widget | | Vercel Edge |
| (12KB) | ---> | Function (API) |
| Conversational UI | | Sub-100ms p95 |
+--------------------+ +----------+----------+
|
+-------------+------------+
| |
v v
+---------------------------+ +---------------------+
| Embed query | | Conversation |
| text-embedding-3-small | | history (Postgres) |
| (1536 dims) | | Last N turns |
+-------------+-------------+ +---------------------+
|
v
+---------------------------+
| pgvector similarity |
| search over ~2,000 chunks|
| (docs, FAQ, past replies)|
| Top 5 by cosine distance |
+-------------+-------------+
|
v
+---------------------------+
| GPT-4o with strict prompt|
| + retrieved context |
| + conversation history |
| Output: |
| - answer text |
| - confidence score |
| - cited chunk IDs |
+-------------+-------------+
|
+-----------------+-----------------+
| |
v v
+---------------+ +----------------------+
| High conf: | | Low conf / explicit |
| Show answer | | handoff requested: |
| + citations | | Create Zendesk |
| to customer | | ticket w/ context |
+-------+-------+ +-----------+----------+
| |
v v
+-------------------------------------------------+
| Audit log: every Q+A, confidence, citations, |
| customer rating (thumbs up/down) |
+-------------------------------------------------+
4. Automation flow
End-to-end runtime flow — what happens when a real input arrives:
- Indexing (one-off + weekly). Crawled the company website, FAQ, product docs, and 18 months of anonymised Zendesk reply history. Chunked into ~2,000 semantic chunks (~400 tokens each). Embedded with text-embedding-3-small. Stored in pgvector alongside chunk text and source metadata. Weekly re-index for any updated docs.
- Query. Customer types message. Widget streams to Vercel Edge Function. Edge function embeds the query, performs pgvector similarity search, returns top 5 chunks.
- Generation. GPT-4o called with strict system prompt: "Answer only from the provided context. Cite which chunks support each claim. If context insufficient, say so and offer human handoff." Plus retrieved chunks plus conversation history.
- Confidence + routing. Confidence derived from: chunk relevance scores + LLM self-assessment of completeness. High → show answer with citation links. Low or "I don't know" → escalate to Zendesk with structured summary.
- Handoff. On escalation, Zendesk ticket auto-created with: customer details, issue summary, attempted answers, retrieved-but-rejected chunks (useful for support team training), conversation transcript.
- Feedback loop. Customer rates each answer thumbs up/down. Down-rated conversations flow to weekly review queue → support team updates knowledge base or tweaks prompts.
5. What success looked like
- Deflect 50%+ of tier-1 customer enquiries with answers indistinguishable from human support
- Never hallucinate — only answer from grounded knowledge base, escalate cleanly otherwise
- Match existing support team's tone (warm, specific, not corporate)
- Customer always gets a human if they want one — no dark-pattern dead-ends
6. Outcomes
| Tier-1 deflection rate | Settled at 55–65% after first two weeks of tuning |
| Customer satisfaction (CSAT) | Out-of-hours satisfaction up materially; in-hours equivalent to human-only baseline |
| Median response time | <5 seconds (vs ~4 hours during business hours, ~14 hours overnight) |
| Hallucination incidents | 0 confirmed after week 3 (refusal rate ~8% on novel questions) |
| Support team capacity | Freed 12-15 hours/week → reinvested in complex case resolution + customer success |
| AI API cost | ~£75/month at 600 conversations/week (~£0.03 per conversation) |
7. Speed improvements
- First-byte response time. <300ms p95 via Vercel Edge Functions (vs ~2-4 seconds with naive serverless cold-start)
- Customer answer latency. Streaming response, first word in <2s, full answer typically 4-8s
- Out-of-hours coverage. Was: nothing until next morning. Now: instant. Customer perceived improvement is enormous.
SEO & visibility growth
- Site engagement signals. Average session duration +28% (customers stay to chat). Bounce rate -12% on docs pages.
- Schema rich results. Site now eligible for FAQPage rich snippets (we generated FAQ schema from the AI's most-asked questions).
- Indirect: support docs improved. Weekly knowledge-gap reviews surfaced 30+ FAQ improvements over 3 months → better content → better organic rankings.
9. ROI math
10. Maintenance model
See care plan tiers for full structure.
11. Honest gotchas — what we'd do differently
- First version hallucinated when retrieved chunks were ambiguous — fixed by tightening the system prompt to require explicit citation of which chunk supports each claim. Made the model materially less prone to confabulating.
- Pricing-page queries needed a special-case live-fetch path (pricing changes weekly, no point caching). Custom tool added to the LLM toolbox: get_current_pricing() — calls live API.
- Privacy review: chat transcripts retained 30 days only. No PII in embeddings (we redact emails/phone numbers before embedding). UK/EU pgvector instance.
- Initial latency was bad (~3-5s first byte) before Vercel Edge migration. Moving the API to Edge Functions cut p95 by 4×.
- Customer "did this answer help?" rating widget initially had 60% click-through but devolved to 15% after a month — we removed the persistent prompt and only show it after specific signal of high-stakes question. Engagement back to 35%.
Have a workflow that looks like this?
30 minutes, no pitch. We'll tell you honestly whether automation will pay back for your specific case.