How to Catch AI failures
Loading video...
Show Notes
In this episode, host Hermes Frangoudis interviews Faraz Siddiqi, co-founder and CTO of Bluejay, about testing and monitoring voice AI agents. Faraz explains how Bluejay uses "digital humans" synthetic customers with accents, background noise, and even emotions to run hundreds of parallel simulations. What used to take 10–12 hours of manual calling to test 200 menu items now takes just five minutes, letting you grab coffee while Bluejay finds failures before real users hit them. He shares why latency over 1.5 seconds breaks trust, how a transcription error turned "Dr. Pham" into "Dr. Fan," and why founders should listen to every call manually before automating. A must-watch for anyone building production-grade voice agents.
Key Topics Covered
- •Manual testing takes 10–12 hours Bluejay cuts it to 5 minutes.
- •Latency above 1.5 seconds breaks user trust.
- •Voice will become the primary interface for most products.
- •Digital humans can get angry, interrupt, and simulate 3G network drops.
- •Don't use an eval platform at 10 calls a week. Listen to every call yourself first.
- •One-click replay turns any production failure into a permanent test case.
Episode Chapters & Transcript
Teaser
Faraz previews digital humans that interact with your product and cut manual voice-agent testing from hours to minutes.
Intro
Hermes welcomes Faraz Siddiqi, co-founder and CTO of Blue Jay, and asks how the company evolved from an early meetup idea to today.
From restaurant calls to Blue Jay's real idea
Faraz recounts walking Third Ave to sell restaurant voice agents and realizing every agent had to handle 150–200 menu items—a problem that pointed toward something bigger.
The 12-hour testing nightmare
Manually testing menu combinations meant 10–12 hours per agent, which led to version zero of Blue Jay—an automated caller—and the pivot into Y Combinator.
Digital humans that test your agent like real customers
Hermes asks how Blue Jay builds trust in testing; Faraz explains code-coverage-style goals, digital humans with accents and background noise, and parallel simulations that cut testing to five minutes.
Angry customers, bad connections, and fake traffic
Digital humans have phone numbers, emotional inflection, and can get angry—simulating realistic production scenarios before real customers ever call.
Simulating hundreds of calls to break your system
Parallel load tests surface autoscaling gaps and latency spikes, with Slack and email alerts when average conversation latency crosses thresholds like 4,000 milliseconds.
Why 1.5 seconds of lag kills trust
Blue Jay tracks utterance and punctuation latency across every simulation—and a 1.5-second pause makes users realize they're talking to AI and shorten their intent.
The one deterministic test for unpredictable AI
Tool call adherence verifies your agent invokes the right backend tools with the right inputs—one of the few deterministic ways to test non-deterministic voice agents.
How to know your agent is about to fail
Heartbeat simulations, production observability, and dashboards catch degradation early—including a transcription bug that booked Dr. Fan instead of Dr. Pham.
One prompt, zero code – AI does the integration
Blue Jay's rewritten docs include copy-paste AI integration prompts for every API endpoint so Cursor or Claude Code can wire the platform in one shot.
Docs for PMs, devs, and AI agents
Faraz argues docs should serve three audiences, voice will become the default product interface, and LLMs make stream-of-consciousness input workable.
Testing every 10 minutes or before every deploy
Customers run heartbeat tests as often as every 10 minutes, block CI/CD deploys on simulation pass rates, and use replay to turn production failures into regression cases.
Don't buy an eval platform yet
At 10 calls a week, Faraz advises listening to every conversation yourself first—automate only once you've learned what you need from each flow.
Turn every production fail into a permanent test
Replay isolates the customer utterances that caused a failure and recreates them as a digital human, growing your test suite from 10 cases to 25 overnight.
GTM teams use Blue Jay to close deals
Sales teams run hundreds of simulations before handoffs and generate PDF test reports as third-party proof that an agent is Blue Jay approved.
The day our own alarms almost broke us
Faraz shares a load-test latency scare on their own infrastructure, teases the Blue Jay World rebrand and a skydiving stunt in Blue Jay suits, and Hermes closes the show.
Click on any chapter to view its transcript content • Download full transcript
Convo AI Newsletter
Subscribe to stay up to date on what's happening in conversational and voice AI.