Most "AI Failures" Are Integration Failures

If you've watched a production AI agent fail in the wild, you've probably watched something that looked like the model. A booking that didn't happen. A caller who got transferred into a void. A customer who was promised a callback that never came. The operator stands over the screen and says, with real frustration, “the AI didn't work.”

Almost every time, it wasn't the AI.

It was the integration. The connection between the model and the system of record. The webhook that returned a 200 but didn't actually write anything. The calendar slot that didn't exist getting offered to a customer anyway. The API field that got silently wiped on the last deploy. The model is fine. The plumbing isn't.

This note collects a few of the integration failures we've watched up close over the last six months — running voice agents for pest control, SMS agents for auto dealerships, DM agents for service businesses. The shapes repeat. If you operate one of these systems, or you're considering one, the patterns here will save you weeks.

01 / The Vapi PATCH IncidentThe most expensive bug we've shipped this year

Here's the setup. Your voice agent calls a custom tool — let's say checkAvailability — which hits a function URL on your orchestration layer. The orchestration layer is configured by sending PATCH requests to a tool API. Standard.

The bug: when you send a PATCH with only the fields you want to change, the platform interprets it as a PUT. Every field you didn't include gets wiped. You meant to update the response timeout. You also just deleted the function URL.

The first time this happened to us, the agent went into production on Monday morning at 7:00 AM. The function URL had been overwritten the night before during a deploy. Every customer call — nineteen of them — went into a state where the agent tried to invoke a tool that pointed at nothing. The agent did its best, talking around the missing data. None of the customers booked. Total revenue lost from a single line of careless code: low five figures.

The model behaved exactly as designed. The plumbing did not.

We shipped a fix the same morning. The same bug happened twice more over the following weeks — each time during a routine deploy, each time because an engineer thought PATCH meant partial update. It does, in HTTP. It doesn't, here.

The lesson isn't “don't use Vapi” or “don't use PATCH.” The platform is excellent. The lesson is the meta-lesson:

— Meta-Lesson —

The implicit contract of every external system you depend on is wrong somewhere. The places where it's wrong will look exactly like the places it's right, right up until the moment they take production down. Read the actual behavior, not the documentation.

We now wrap every destructive API call in a helper that fetches the full record, merges our changes on top, and submits the merged object. It is uglier code. It has prevented every integration failure of this class since.

02 / Routing Failures Look Like AI FailuresWhen the agent does its job but no one ever knew

The second pattern is subtler and shows up later. You build the agent. You launch it. The phone rings — or doesn't. Two weeks later the operator says “I don't think the AI is doing anything.”

You pull the logs. The agent is alive. The model is responding. The integrations are clean. The agent is just not getting any traffic.

This happens because somewhere upstream of the agent — usually at the phone provider, sometimes at the website form — a routing rule changed. The agent was “destination C” in a forwarding tree. Someone in the phone system reshuffled the tree and the agent is now destination F, behind a number that nobody dials. The agent is fine. It's just been quietly unplugged.

This is the integration failure that nothing alerts you to. Your monitoring sees heartbeats. Your watchdog sees the agent is healthy. Your error logs are clean — because nothing is happening. Silence is indistinguishable from peace.

The agent is healthy. The integration is broken. The customer never knows your AI exists.

The fix isn't more agent monitoring. The fix is baseline-aware monitoring: if this phone line averaged 14 calls/day for the last 30 days, and today it has 0 calls by 11 AM, that's the alert. Volume regressions need their own dashboard, and the dashboard needs to know what normal looks like before it can tell you something's wrong.

We didn't build this on day one. We built it after the third silent failure.

03 / The Calendar That LiedWhy “available” isn't a property of a slot

The third pattern: the agent offers a 2:00 PM appointment. The customer accepts. The booking goes into the CRM. The technician's calendar shows nothing.

You investigate. The CRM and the calendar are connected. They've been connected for years. The connection was last verified eight months ago.

Somewhere between then and now, the integration that pushes CRM bookings to the field technician's calendar app started silently failing. The CRM thinks the appointment exists. The technician's phone has no idea. The customer shows up at 2:00 PM to an empty house and a missed appointment fee.

This wasn't AI. This was a webhook returning a 401 quietly for six weeks. But the operator's experience is: “the AI agent booked something that didn't actually happen.”

The fix is two layers:

The agent must check, not assume. Before offering a slot, the agent verifies the slot is open on the calendar of record — not the cached system, not the upstream CRM. The actual calendar the technician looks at.
Every booking writes a verification ping. After the agent writes the booking, an independent check reads it back from the calendar five minutes later and pages a human if the read doesn't match the write.

This is more code. It is the difference between a system that books appointments and a system that guarantees appointments.

04 / The Two LatenciesToken latency vs. orchestration latency

One more pattern, less catastrophic but pervasive: operators complain the agent is slow. The model is fast — you can see that in the dashboard. The agent still feels slow. Why?

There are two latencies. Token latency — how long the model takes to produce the next token. Orchestration latency — everything else. Fetching the customer record. Looking up availability. Posting to the CRM. Calling the third-party booking tool. Waiting for that tool to return. Waiting for the tool's downstream API to return. Each of those steps takes 150–800 milliseconds. The customer hears silence while they all run.

The model returns in 400 milliseconds. The customer waits four seconds. The four seconds aren't the model.

Almost every “slow agent” complaint we trace ends up being an orchestration latency problem, not a model problem. Cache aggressively. Parallelize tool calls. Speak first, then look things up. The agent should say “Let me check on that for you” out loud while the lookup runs in the background. Filler buys you 800 milliseconds and turns it from awkward into conversational.

05 / What This Means If You're Buying AI

If you're an operator evaluating an AI agent for your business, three questions matter more than any model benchmark:

How is the agent monitored for silent integration failures? Heartbeats aren't enough. Ask specifically about volume regressions and write-verification.
What happens when an external API returns success but didn't actually do anything? Every production integration has at least one of these. The vendor's answer should be specific.
Who owns the integrations after launch? Most failures we've seen happen at month three, not week one. The vendor needs to be on the hook for the long tail, not just the launch.

If the answers are vague, the agent will work fine for a month and then start producing the kind of failures the operator will blame on “the AI.”

06 / What This Means If You're Building It

If you're shipping these systems, the meta-pattern is: treat integrations as the product, not the model. The model is a commodity. The reliability of the plumbing around the model is what operators are actually buying, even when they don't know it.

Specifically:

Wrap every destructive API call. Never trust documented PATCH semantics.
Read what you wrote, on a delay, from the system the human actually uses.
Build baseline-aware volume monitoring, not just health monitoring.
Pre-fetch and parallelize. Filler the user-facing latency.
When a customer complaint comes in, default to the integration, not the model. You will be right 90% of the time.

None of this is glamorous. Almost none of it is in a model paper. But this is what separates an AI demo from an AI system. The demo runs once, in a controlled environment, and gets a 10-second clip on a slide. The system has to run 14,000 times in conditions you didn't predict, against APIs that change without telling you, on top of an operator's calendar that someone else's intern wrote in 2019.

The model is the easy part now. Building production AI is, mostly, building the connective tissue around it. This is the work.

Most “AI failures” are integration failures.