Back to Episodes

How to Catch AI failures

May 28, 2026

00:44:08

Loading video...

Show Notes

In this episode, host Hermes Frangoudis interviews Faraz Siddiqi, co-founder and CTO of Bluejay, about testing and monitoring voice AI agents. Faraz explains how Bluejay uses "digital humans" synthetic customers with accents, background noise, and even emotions to run hundreds of parallel simulations. What used to take 10–12 hours of manual calling to test 200 menu items now takes just five minutes, letting you grab coffee while Bluejay finds failures before real users hit them. He shares why latency over 1.5 seconds breaks trust, how a transcription error turned "Dr. Pham" into "Dr. Fan," and why founders should listen to every call manually before automating. A must-watch for anyone building production-grade voice agents.

Key Topics Covered

•Manual testing takes 10–12 hours Bluejay cuts it to 5 minutes.
•Latency above 1.5 seconds breaks user trust.
•Voice will become the primary interface for most products.
•Digital humans can get angry, interrupt, and simulate 3G network drops.
•Don't use an eval platform at 10 calls a week. Listen to every call yourself first.
•One-click replay turns any production failure into a permanent test case.

Resources & Links

→ BluejayTest, monitor, and improve voice and chat AI agents with digital humans and production observability → Agora Conversational AI EngineThe industry's most powerful and flexible platform for building conversational AI.

Episode Chapters & Transcript

0:00:00

Teaser

Faraz previews digital humans that interact with your product and cut manual voice-agent testing from hours to minutes.

0:00:47

Intro

Hermes welcomes Faraz Siddiqi, co-founder and CTO of Blue Jay, and asks how the company evolved from an early meetup idea to today.

0:01:23

From restaurant calls to Blue Jay's real idea

Faraz recounts walking Third Ave to sell restaurant voice agents and realizing every agent had to handle 150–200 menu items—a problem that pointed toward something bigger.

0:03:51

The 12-hour testing nightmare

Manually testing menu combinations meant 10–12 hours per agent, which led to version zero of Blue Jay—an automated caller—and the pivot into Y Combinator.

0:05:10

Digital humans that test your agent like real customers

Hermes asks how Blue Jay builds trust in testing; Faraz explains code-coverage-style goals, digital humans with accents and background noise, and parallel simulations that cut testing to five minutes.

0:07:44

Angry customers, bad connections, and fake traffic

Digital humans have phone numbers, emotional inflection, and can get angry—simulating realistic production scenarios before real customers ever call.

0:08:34

Simulating hundreds of calls to break your system

Parallel load tests surface autoscaling gaps and latency spikes, with Slack and email alerts when average conversation latency crosses thresholds like 4,000 milliseconds.

0:10:43

Why 1.5 seconds of lag kills trust

Blue Jay tracks utterance and punctuation latency across every simulation—and a 1.5-second pause makes users realize they're talking to AI and shorten their intent.

0:13:11

The one deterministic test for unpredictable AI

Tool call adherence verifies your agent invokes the right backend tools with the right inputs—one of the few deterministic ways to test non-deterministic voice agents.

0:16:38

How to know your agent is about to fail

Heartbeat simulations, production observability, and dashboards catch degradation early—including a transcription bug that booked Dr. Fan instead of Dr. Pham.

0:19:52

One prompt, zero code – AI does the integration

Blue Jay's rewritten docs include copy-paste AI integration prompts for every API endpoint so Cursor or Claude Code can wire the platform in one shot.

0:22:31

Docs for PMs, devs, and AI agents

Faraz argues docs should serve three audiences, voice will become the default product interface, and LLMs make stream-of-consciousness input workable.

0:28:53

Testing every 10 minutes or before every deploy

Customers run heartbeat tests as often as every 10 minutes, block CI/CD deploys on simulation pass rates, and use replay to turn production failures into regression cases.

0:31:26

Don't buy an eval platform yet

At 10 calls a week, Faraz advises listening to every conversation yourself first—automate only once you've learned what you need from each flow.

0:34:15

Turn every production fail into a permanent test

Replay isolates the customer utterances that caused a failure and recreates them as a digital human, growing your test suite from 10 cases to 25 overnight.

0:37:43

GTM teams use Blue Jay to close deals

Sales teams run hundreds of simulations before handoffs and generate PDF test reports as third-party proof that an agent is Blue Jay approved.

0:40:29

The day our own alarms almost broke us

Faraz shares a load-test latency scare on their own infrastructure, teases the Blue Jay World rebrand and a skydiving stunt in Blue Jay suits, and Hermes closes the show.

Click on any chapter to view its transcript content • Download full transcript

Convo AI Newsletter

Subscribe to stay up to date on what's happening in conversational and voice AI.

Loading form...

✓ Conversational AI news✓ No spam, ever✓ Unsubscribe anytime

Tags

#bluejay#faraz siddiqi#hermes frangoudis#voice ai testing#voice ai monitoring#digital humans#eval platform#observability#latency#tool call adherence#ci/cd#replay#heartbeat simulations#conversational ai#voice agents#production voice ai#y combinator