We Built an AI Voice Agent Using Claude Code and Vapi

There's a specific kind of irony in being an AI automation agency that does its own business development by hand. You spend your days building systems that eliminate manual work for clients, then close your laptop and spend an hour writing cold emails that get ignored.

We got tired of it. So we built Finn.

Finn is an AI cold caller. He dials small business owners, opens with a curiosity hook, figures out whether they've thought about AI, and books a callback for our founder Bailey if there's genuine interest. No script-reading, no hand-holding — a fully autonomous voice agent making real outbound calls to real people.

This is the story of how we built it, what broke, what worked, and what we'd do differently.

The Architecture: Separate What Thinks from What Acts

Before we get into Vapi specifics, the most important decision we made had nothing to do with Vapi. It was the architecture.

We call it the WAT framework — Workflows, Agents, Tools. The principle behind it comes from a hard truth about AI systems: if each step in a chain is 90% accurate, five chained steps land you at 59% success. The math compounds against you fast. The fix isn't better AI — it's less AI. Let AI handle reasoning. Let deterministic code handle execution.

Layer 1: Workflows are Markdown SOPs stored in workflows/. Each one defines the objective, required inputs, which tools to call, expected outputs, and how to handle edge cases. Written like you'd brief a new hire. Updated after every campaign.

Layer 2: Agents are the decision-makers. This is where Vapi and the LLM live. The agent reads context, makes judgment calls, routes conversations, and fires tool calls at the right moments. It doesn't do API calls. It doesn't write to spreadsheets. It decides.

Layer 3: Tools are Python scripts in tools/. They do the actual work — API calls, sheet writes, database writes, webhook handling. Consistent, testable, and fast. When a call ends, a Python function writes the outcome to Google Sheets. Finn never touches a spreadsheet directly.

The resulting flow looks like this:

run_campaign.py → Vapi (Finn on the phone) → webhook_server.py → mark_outcome.py → Google Sheets

Everything in that chain except the Vapi assistant is deterministic. Finn decides. Python executes.

How Claude Code + Vapi Work Together

Vapi handles all the telephony infrastructure — call placement, audio streaming, and transcription. Under the hood it uses Twilio for calling and Deepgram for speech-to-text. It hosts the assistant and streams real-time events (tool calls, transcription, end-of-call reports) to a webhook URL you control.

The prompt is the agent's brain. For Finn, that's agents/ai_curiosity_finn/prompt.md — a source-controlled, versioned Markdown file that gets deployed to Vapi via a setup script. Change the prompt, run one command, and the live assistant is updated within seconds. It's treated exactly like code: reviewed before merging, updated only when data supports it, never changed on a whim.

Claude Code was used throughout the build as the development environment — writing the Python tooling, debugging webhook flows, designing the outcome classification logic, and iterating on the prompt structure. It's not running in production; it's how the system gets built and improved.

The key configuration decisions in agent.json:

{
  "model": { "provider": "openai", "model": "gpt-4.1", "temperature": 0.3 },
  "voice": { "provider": "openai", "voiceId": "echo" },
  "transcriber": { "provider": "deepgram", "model": "nova-2-phonecall" },
  "firstMessageMode": "assistant-waits-for-user",
  "maxDurationSeconds": 300,
  "silenceTimeoutSeconds": 10
}

Temperature 0.3: Low randomness. Finn needs to be consistent, not creative.
nova-2-phonecall: Deepgram's model specifically tuned for phone audio — compressed, noisy, sometimes clipped. This matters.
OpenAI "echo" voice: Conversational, not robotic. Enough warmth to not kill the call in the first five seconds.
firstMessageMode: assistant-waits-for-user: More on this below. We never changed it.

The local startup flow is a single script:

./start.sh
# → kills any existing webhook server and cloudflared processes
# → starts FastAPI webhook server on port 8000
# → starts cloudflared tunnel to get a public HTTPS URL
# → writes the tunnel URL to .env as WEBHOOK_URL
# → deploys Finn to Vapi via vapi_setup.py

One command. Production-ready in under 30 seconds.

The Two Sales Motions — and Why We Pivoted

We shipped two completely different approaches on the same infrastructure.

Motion 1 — Speed-to-Lead (frozen): Finn opened with "The reason I'm calling like this is we build systems that respond to new leads instantly — and this call is basically what that looks like in practice." The call was the demo. It's a clever wedge, but it assumes the prospect already understands what speed-to-lead automation is and why it matters. Most small business owners don't. Calls stalled at the concept.

Motion 2 — AI Curiosity (active): "Hey — do you want the good news or the bad news?" No pitch. No demo. Just a question. The reveal: bad news is it's a cold call, good news is we're genuinely curious whether AI could help their business. Then a single diagnostic question: "Have you guys looked into using AI in your business yet, or not really?"

Three response paths branch from there:

Path A (yes / we use ChatGPT / been thinking about it) — dig into use cases, ask about their workflow
Path B (hard no / not our thing) — one brief curiosity statement, then a business question; no third push
Path C (a little / sort of) — treat like Path A

The pivot took about two hours. Swap prompt.md, re-run vapi_setup.py, update registry.json to flip the active flag. Same webhook server, same sheet integration, same outcome logic. The infrastructure didn't care which motion we were running.

That separation — between the conversation logic and the execution infrastructure — is what made the pivot cheap.

Eight Engineering Lessons Nobody Writes About

1. `firstMessageMode` matters more than you think

assistant-waits-for-user means Finn says nothing until the person on the other end speaks first. This is correct. If Finn speaks first and hits an IVR greeting or a gatekeeper mid-sentence, the call sounds broken immediately. Always wait. Let the human open.

2. IVR detection is not optional

We didn't account for it in v1. Finn would reach an automated phone menu and try to have a conversation with it. The fix is a single prompt rule:

"If you hear an automated phone menu — any message that says 'press 1 for', 'para español', 'if you know your party's extension' — call endCall immediately. Do not say anything."

Simple. Works. Add it on day one.

3. Phone number collection needs ritual

Deepgram transcriptions occasionally drop digits, especially when a prospect rattles off a number quickly. Our fix: collect the number in groups with deliberate pauses, then mandatory read-back before logging it.

"Let me just read that back — six oh six... four five four... five one four oh. Does that sound right?"

That process dropped capture errors from around 5% to under 1%. It also sounds natural — it's what a human would do on the phone.

4. Name memory is non-negotiable

The fastest way to destroy a call is for Finn to ask for someone's name after they already gave it. We've all experienced this with automated systems. It signals immediately that nothing is actually being listened to.

The fix is an explicit, emphatic prompt rule — "if they gave you their name at any point in this call, do not ask again, under any circumstances" — reinforced with a CORRECT/WRONG example block showing the exact scenario. Belt and suspenders. It works.

5. Sort leads before dialing

No-website businesses — owner-operated — owner answers the phone. Fewer reviews means smaller operation means less likely to have a gatekeeper or IVR. Sorting by those two signals before dialing meaningfully improves the rate at which Finn reaches an actual decision-maker. No machine learning, no external data enrichment. Just a smarter sort order.

6. Pre-mark rows as "Dialing" before the call fires

If the webhook fails partway through a call, the call still happened. If the row isn't marked, the next campaign run re-dials the same person. Pre-marking the sheet row as "Dialing" immediately before firing each call prevents double-contacts. The row number is the primary key — idempotent writes, no surprises.

7. Signal-based classification beats ML for now

After every call, Finn's transcript gets classified into an outcome: warm lead, engaged, not interested, voicemail, wrong number, gatekeeper. Rather than training a classifier, we scan transcripts for pattern lists with hierarchical precedence.

WARM_LEAD_SIGNALS = [
    "you're all set",
    "bailey will give you a call",
    "that sounds interesting",
    "looking forward to it",
]

ENGAGED_SIGNALS = [
    "the biggest thing",
    "honestly",
    "we spend a lot of time",
    "scheduling",
    "follow up",
    "quoting",
    "callbacks",
]

If the transcript contains a warm lead signal, it's a warm lead — regardless of what else is in there. If no warm lead signal but an engaged signal, it's engaged. And so on down the hierarchy.

This trades some accuracy for complete interpretability. When the classification is wrong, you can read the Notes column and immediately see which signal fired. Easy to debug, easy to improve.

8. Prompts are source code — treat them that way

The prompt is checked into git. It has a changelog. It gets updated based on campaign data, not instinct. Our rule: don't touch the prompt unless you have 50+ calls and the AI curiosity rate is above 12%. Below that threshold, you don't have enough signal to know what's actually working.

When changes are warranted, fetch_campaign_data.py merges call transcripts with campaign logs and surfaces a unified diff suggestion. That suggestion goes through the same review process as any other code change.

What We'd Do Differently

A/B hook rotation from day one. We launched with a single hook for the AI Curiosity motion. That means we can't test variants without rebuilding campaign history. The infrastructure already supports multiple hooks — we just didn't use it early enough.

Twilio Lookup pre-flight from the start. For $0.005 per number, Twilio's Lookup v2 API tells you whether a phone number is a mobile, landline, VoIP, or virtual aggregator number before you dial it. Virtual/aggregator numbers (Google Voice, TextNow, etc.) almost never reach a real decision-maker. Screening them out saves Vapi per-minute costs and keeps call quality high.

Direct calendar booking instead of email notification. Right now, when Finn books a callback, the webhook fires an email with the prospect's name and preferred time. That works, but it requires a manual step to find a slot and send a follow-up. A proper integration that drops a calendar invite with a Meet link would close that loop in a single step.

The Bigger Point

Building Finn taught us more about what voice agents actually need than any tutorial could. The technical pieces — Vapi, Deepgram, webhooks, tool calls — are all well-documented. What isn't documented is the operational layer: how you handle IVRs, why name memory matters, how to sort a lead list, why your prompt is a production artifact that deserves version control.

That layer is where most voice agent projects fail. Not because the AI isn't smart enough — because the surrounding system isn't solid enough to let it work.

We're still running campaigns with Finn. The system improves with every batch of calls. And the irony hasn't been lost on us: the best demo of what we build for clients is the thing we built for ourselves.

If you're thinking about building something like this — or you want someone to build it for you — that's what we do.

Get in touch →