Hermes Frangoudis (00:00)
So this is almost like customer personas in that next level in the age of AI. It's like not just a bunch of words on paper that you're talking about in a product meeting. This is like a digital human that gets to actually interact with your product and and put that to the test.

 Faraz Siddiqi (00:14)
You

don't have to like spend 10, 12 hours doing it. You could just do it in like five minutes, grab a coffee, come back, the simulation's done. It'll tell you what went wrong and how to fix your agent. Like the age of developers manually integrating your APIs is coming to an end. The age of developers clicking a button, pasting it into their cloud code instance and or you know, cursor instance or you know open AI chat or wherever it is, right? And just shooting out the integration through an agent is is coming, right? Or maybe it's already here.

Hermes Frangoudis (00:47)
Hey everyone, welcome to the Convo AI World Podcast, where we interview the founders and leaders pushing the voice AI space forward. Today, I'm very excited to have Faraz, co-founder and CTO of Blue Jay, here with us today. Welcome, Faraz. It's so good to see you again. And it was funny we were just talking about how we met, and it was very early on in your in your journey with Blue Jay. I think we met just as you were had this idea and like take us from

 Faraz Siddiqi (01:01)
Hey, thanks for having me.

Hermes Frangoudis (01:16)
From that point where we met at a meetup where you're like, All right, I have this idea for a startup. How do we get to Blue Jay today?

 Faraz Siddiqi (01:23)
Yeah. Yeah. It was just crazy. We were talking about that. I was back when I was in New York. I was working in big tech. And this was just like a fledgling idea in the back of my mind. And now we are here. It's actually exciting to bring it now. Tomorrow is our one year anniversary of Blue Jay. So that'll be April seventh. Super, super excited about that. Yeah, yeah, I'm really, really excited there. ⁓ but back then, back when I was in New York, ⁓ you know, Blue Jay was just like an idea in the back of my mind.

my co-founder Rohan, he was in Seattle. So we were working across the coasts, you know, trying to make this stream into a reality. We were actually not even building what Blue Jay is today. We were actually building something totally different in an adjacent space. So for reference, Blue Jay Today is a testing and monitoring platform for voice and chat AI agents. We basically ensure that your conversational AI agents are reliable, both in pre-production and in production environments. But

Back a year ago, we weren't even building the reliability infrastructure for voice AI. We were building in voice AI. We were building voice AI agents for restaurants. And what I would do is I would walk down Third Ave, which is the avenue where I was living in New York, and I would go down from restaurant to restaurant. And I'd be like, hey man, like, you know, what are your problems? What are your issues? Basically trying to do like a in-person sell rain or shine. And they would never tell me like what their actual issues were, but they would always be, you know, on the phone when they were talking to me, right? They were like half doing this, half talking to me, right.

So that's kind of how we got the idea to start by building, you know, a voice AI agents for restaurants. So we got some people interested. And when we started to build the demos out for these voice AI agents or for these restaurants, it came to the point where we had to like handle like every single voice agent we built had to handle like 150 to 200 different menu items for each restaurant. And that is really painful to check. Imagine like, you know, a menu item getting ordered, imagine

 Faraz Siddiqi (03:51)
Checking your square POS to make sure that the order was actually placed properly, doing evaluations on top of that, right? Doing all the combinations for it. So I'll give you examples. It's like, you know, you're sitting there and you're saying, Hey, can I order a cheeseburger with fries? And then you check your menu, your POS, right, to see if it got, you know, entered. And you say, Hey, can I order a cheeseburger with fries? And can I also get a drink? And then you check your system again. Right. So that entire process is probably like three to four minutes a call. It was probably another two minutes on top of that to evaluate the call.

Times like 200 different menu items, plus reservations, plus all the other capabilities of a restaurant. Basically meant that manually calling your own voice agent to test it was a really hard problem. It was just something that took way too much time to do. And like one proper end-to-end test took on the order of 10 to 12 hours if you did it back to back, diligently, without having any breaks, and also with one person. So obviously that was not something that was useful. We were like, my gosh, like this is gonna like totally kill our deployment time and you know, it's gonna like kill our iteration time, all of this stuff.

so version zero of Blue Jay, right, was literally just us building like an automated caller to try to test that voice agent we built. Right. ⁓ and then, you know, as as time went on, we're like, wait a second, hold up. This automated caller thing is a lot more exciting than the voice AI per Rush trusting for in our specific use case, because I guess I think personally we spent a lot more time on the test suite and like making sure it worked and everything because we're lazy engineers. We want to make sure that.

You know, we're calling things as little as possible. And when when we built that automated caller, that's when a couple of our other friends in Voice AI started to get excited about it. And they're like, my gosh, like this is really cool. Like, you know, in fact, more excited about, you know, the testing platform for voice rather than the actual, you know, voice agent itself that we were building. So we made the pivot. And we, you know, we we did that. We almost like, you know, within like weeks of us or maybe a month of us making that pivot, yeah, we started to get some traction. We applied to Y Combinator. We got in. So we were in the X25 batch.

It's now called P twenty five. I think they changed their name. And yeah, we went through the batch. We raised our seed round like actually towards the end of the batch before the batch finished. And we kept on doubling our growth over and over again month after month. And yeah, now we're here today and it's gonna be a year and tomorrow.

Hermes Frangoudis (05:10)
That's so exciting. It's amazing how the journey starts in one direction and then you realize a completely different pain point, right? Which you're like, if I'm feeling this, there's gotta be other founders building these agents that need this. It's super cool. So when you talk about building this end to end layer, it's really about trust, right? And and how do you build that trust into the testing layer?

 Faraz Siddiqi (05:36)
Yeah. So I mean, there's a lot of things we do. So let's let's start it starts off in our e bells, right? So we want to do evaluations that actually make sense for an industry that for a specific industry, right? ⁓ and the best way to imbue trust into that like evaluation layer is to allow the people who know what they're doing best to write the test cases. Right. So when on Blue Jay, when you when you when you come out to the platform and you start off by saying, Hey, I want to test my agent, the first thing we do is instead of giving you tests, we ask you about your agent, right? So we say give us like,

know information about your agent, give us like your website, give us like a system prompt if you have it, right? Like stuff like that. Right. So we can get as much context about your agent as possible. And from there we go and look through your your system prompt. And almost in like a code coverage style fashion, we try to cover all the different things that your agent should be able to do. Right. And once we create that, we call them goals on our platforms. Once we create goals or once you add goals into into the platform, then what we do is we generate digital humans, right? Which are like one of the core concepts of Blue J.

That will actually go and test your agents. ⁓ so when I said that Blue Jay is a testing platform for voice AI, the way that we do it is we create digital humans, which are synthetic replicas of real people, right? Like i.e. your customers that will go and call your agent. And this is like, you know, kind of rings true to our founding story, which was, you know, us initially calling our, you know, voice AI agent for restaurants over and over again. We're like, hmm, how do we replicate that like the best way possible? What if we created a digital version of us, right, that does the calling for us, right?

So hence the name digital human, right? So we create these digital humans that go and do many different things, right? They can talk about whatever they want to talk about, they could place orders, they can schedule appointments, they could do whatever you want them to, right? But on top of that, they have different accents, languages, background noises, they have different audio qualities, they can interrupt your agent. so they're very like human like, they're very lifelike. And what we do is we spawn hundreds, if not thousands, of them to go and call your agent in parallel to figure out what's going wrong.

without having to do this sequentially. So what that means is like now if you want to test, you know, in our case, the two hundred different things that your agent should be able to do, you don't have to like spend 10, 12 hours doing it. You could just do it in like five minutes, grab a coffee, come back, the simulation's done, it'll tell you what went wrong and how to fix your agent.

Hermes Frangoudis (07:44)
Super interesting. So this is almost like customer personas in that next level in the age of AI. It's like not just a bunch of words on paper that you're talking about in a product meeting. This is like a digital human that gets to actually interact with your product and and put that to the test.

 Faraz Siddiqi (07:59)
Yeah, and to that point, I mean, they literally have phone numbers. You can call them, they can call you. Sounds like you're speaking to a real human being, right? They have emotional inflection, they have tone, right? They can raise their voice, right? And so there's like a lot of cool things that's going on. Like if you're calling a customer service pipeline, our digital humans have literally gotten angry that they can't return their TVs, for example. Right. So it's it's pretty exciting stuff that you could do ⁓ with digital humans on Blue Jay. But like the the the key point there is that, you know, they can literally call your agent, like

via the agent's phone number and and simulate like really realistic customer scenarios, like whatever you would experience in production but before you push it to to real customers.

Hermes Frangoudis (08:34)
So you'd get that high volume simulation as well because when you're testing one to one, it's just you and and an AI. That's not really indicative of the the production landscape when they go out. They're handling, like you said, hundreds of calls per minute or per second even. And how does that scale and how does that break down? That's super

 Faraz Siddiqi (08:52)
Exactly.

Well, actually it's a really good point. I mean, the other the other part of this is like especially when you start doing like hundreds of simulations at the same time, you start to also test your systems load, right? And like, you know, you could figure out like you set these alerts on Blue Jay to notify you via Slack or email or whatever, right? That whenever your agents like average latency in a conversation is over like let's say four thousand milliseconds, right? That means that, hey, by the way, if this happened maybe ten times in the past five minutes, means that maybe one of your nodes is getting overloaded and your autoscaling is not working.

just spin more nodes manually or hey, by the way, like a big catastrophic failure is about to happen, right? But it hasn't happened yet. Right. And so that's that's kind of the idea. Blue J can help you tell the future in that sense.

Hermes Frangoudis (09:31)
Super cool. And I wanna go back to one of the things that you mentioned. Like these synthetic users, right? The the they they have real personalities, you said? Like they do they do you programmatically like kind of spin them up and figure out like, okay, this one's gonna be in the city, this one will be at home? Like how how does that work?

 Faraz Siddiqi (09:42)
Like.

You

Yeah. So what you could do is like we have the ability to configure digital human behavior, right? ⁓ and what you could do is you can literally make them like, you know, stuck in traffic and the traffic is really loud. So you can raise the volume of that background noise of traffic. You can go ahead and, you know, since it's raining outside, maybe their audio quality is not that good. So you can have medium audio quality. So we simulate packet interpolations like you know, like jitter and packet drops and like all these things that make it sound like you're speaking out of like a 3G

you know, like network, Android, 2010, just to simulate like really bad conditions, right? Cause sometimes that impacts like your, you know, your ASR transcription will impact like your LLM's like decision making, right? Another important reason why voice to voice, by the is is is a is a really good testing modality if you're building voice agents, of course. But like yeah, we we we we do a lot of things that allow you to go configure a digital human's behavior to make it more and more like realistic or or tuned to your customers' definition of what real is. Super cool.

Hermes Frangoudis (10:43)
So you mentioned latency, not only in like, hey, your your nodes not scaling up and these things have like four thousand milliseconds of latency, but also what happens with like poor connection and how does that impact the whole round trip? So when you do those kind of tests, is that just like a latency portion or is that built into the overall scheme where it's like randomly distributed in?

 Faraz Siddiqi (11:10)
Yeah, yeah. So we I mean we absolut we we test latency as a as a default, you know, kind of metric and that exists on Blue Jay. So like no matter what, like in every simulation that you run, we're gonna test your latency. We're not gonna just test the average latency, we're gonna test your utterance latency, your, you know, actually sub utterance latency, for example, it's called punctuation latency. So if an agent has like maybe five sentences in an utterance that it says back to back, we'll even test to see how long it took between, you know, one sentence finishing to the next sentence starting inside of an utterance, which is something that's pretty cool.

that helps you like determine things that related to how natural your agent is speaking and so on. ⁓ and we also of course have like P fifties, P nineties, P ninety five, ninety nine latencies as well for you. ⁓ and I think like we released an article about this as well. But like the idea is that in voice AI, people are very unforgiving about about latency. So it's it's one thing to have like a realistic sounding voice. And I think TTS, you know, providers are killing it nowadays with, you know, more and more and more realistic models. But the latency is like still a really big problem.

In the sense that if a customer is picks up a phone call and they hear a realistic voice and then they hear like a 1.5 second lag before the next response, that is a really big pain point for them because they'll immediately realize this is AI. And because of that, their conversation length will go from conversational as if they're talking to a human to one or two words, right? As if they're talking to AI, which is like that's just a standard. People are not gonna go and yap their heart out when they're speaking to AI agents, right? But because of that, right, they actually might

on their own restrict their intent of what they want to do. So if if if a customer comes in and says, hey, actually like I want to schedule an appointment with Dr. Fam, you know, two weeks from now, ⁓ you know, like my knee's kind of hurting. So that's what that's why that's why I want to do this. Right. If they're talking to an AI agent, maybe they're just like operator or appointment, appointment or something like that. Right. So like in the interest of like

Because they realize they're talking to AI, they actually might make their statements more concise to the point where they're actually losing information to another unintended side effect of of latency being a problem, more like a downstream effect. But but yeah, basically what Blue Jay does is we monitor that latency for you in every conversation so you can use it as a KPI to optimize.

Hermes Frangoudis (13:11)
Super interesting. You never think about that piece of it where when you're deploying an agent, it's not just about the voice sound, but that piece of like, does the user ever realize it's an agent? And how does that affect their like overall behavior? Yeah. Absolutely.

So aside from latency, what are other like important metrics that a team should monitor once they're building these agents and have them live in production?

 Faraz Siddiqi (13:39)
Yeah. So it depends per team on what's important. So like I'll give you like the some general examples, but then I'll also like quickly talk about how every team's metrics are different. Right. So the the two general ones I think that are really important are of course latency. There's another really big one called tool call adherence, right? So in the end, like what you really want to know is hey, by the way, one mental model of thinking about voice agents or like L LMs in general is like as an interface to interact with your product, right? Like

can think of almost like a voice agent as like the interface to interact with your business if it's manning the front desk of your of your phone line, right? And so if I'm calling, for example, a clinic, right, and there's a voice agent at the beginning of it, right? At the at the phone number of it, right, when you call the phone number, that means that that voice agent is the mouthpiece of someone's business, right? And it gives the customer the ability to interact with the business. That's like the mental model I want us to be thinking about. So for example, if someone calls in to schedule an appointment, their interface is the voice agent to schedule the appointment with the business. Right. So

One important part about this is that if that is your interface, you need to make sure that it is calling what it's supposed to call under the hood and representing your business properly to that to that customer. So what I mean by that, like very specifically, is like if your agents, if your customer says, Hey, I want a book appointment, book an appointment for next week on Tuesday, your voice agent should be able to parse that and write that into the or call the book appointment tool, right, with the right inputs and the expected output parameters, right? Because it is just the interface.

Right. So you should be able to be able to test that interface to make sure it does what it's supposed to do. So on Blue Jay, right, tool call adherence is like a first class feature for us, right? You can create digital humans and you can list expected tool calls that the digital human should incur throughout that conversation or that your agent should call as a result of what the digital human will call about. And what the agent will do is it'll call the tools, we will log that and we'll make sure that your tools that were called in the conversation were the expected ones, like the actual was the expected.

And in this conversation. So tool call hearings is really, really important. ⁓ and it's a really good deterministic way, which is like there's not many deterministic ways to test non-deterministic agents, but this is a good one. it's a very good deterministic way to test if your agent is doing what it's supposed to do. But for every single customer is very different. Like for example, what a a public bank's voice agent would want is totally different from what a healthcare agent would want, which is totally different from what a voice AI agent, what a restaurant agent would want. If someone's asking for if someone's really curious to hear about, you know, whether they're placing whether they're always upselling a

fry a burger combo into a burger fries and a drink combo, right? You could do that for restaurant agent, but it wouldn't make sense to do it for a healthcare agent, which is why on Blue Jay, we have the ability to add custom metrics, right, depending on what your use case is, right? Which kind of goes back to Blue Jays like main purpose, which is like for us to be a general purpose evaluator, right? We want to be able to give people the tools to write metrics that make sense for their business. We don't want to be prescriptive about the tools that are about the metrics that we think are important, other than

Hermes Frangoudis (16:22)
We want to give people

 Faraz Siddiqi (16:31)
the main ones, which is like latency, tool quality, and instruction following, like these really massive features that are that are shared across the industry. So

Hermes Frangoudis (16:38)
No, that makes sense. And

I like how you say it. The one of the few deterministic ways to test a non deterministic model because that's always one of the big gripes. It changes every time. Like a little bit is different here and there, but that's the same as with a human, right? Like a human having a like a tran like a script in front of them and they're reading from a script, they can always vary it from call to call. And so it makes sense that you would have like, okay, these are the intended tool calls we're trying to get out of you.

Then here's the tool calls you actually made. Do they line up? Like where do they fail? What what happened? So what are some of the like early warning signs to tell when like a system starts to degrade before people freak out and complain? Because they're not very forgiving.

 Faraz Siddiqi (17:21)
Yeah, yeah. It so once again for every business it's it's it's really, really specific. the general ones are still, hey, by the way, if your like average latency is over a certain amount, right? Maybe one of your nodes are not auto scaling properly, like that's like probably like the a really big one that could help you like figure out, okay, like, you know, nodes are literally not like scaling properly. So like my system's gonna go down if we don't auto scale properly. And it's really cool that Luigi can kinda tell you about that. There are also other good practices that you could have, like, for example, like heartbeat simulations, right? Which is something we support.

Which is basically like running a scheduled simulation on your agent once every hour, once every thirty minutes, something just to guarantee uptime. So it's almost like uptime monitoring for your voice agent in production, right? But the other the other thing that really helps with this is observability, right? And this is like by the way, the second big thing that Blue Jay offers. So I mean the first part of this was simulations, the second part of this was observability, right? So what Blue Jay does is we listen to every single call between your agent and real customers.

And we figure out exactly what went wrong. We send you alerts for when issues happen, right? so we're almost like a like an agentic sentry for for voice AI, if that makes sense. Right. ⁓ and we basically listen to every single call and we should make sure that things are going as they're supposed to. The other cool thing there is you can visualize these issues on dashboards. So for example, if you're curious to see whether your agent, you know, ⁓ upsold the customer for that restaurant example, right? From like burger to like a burger fries and a drink, right, over the past thousand calls, you can actually visualize that as a bar chart of like percentage of calls per day that were

you know, upsold and you can use that to make like key business decisions. So it's like really exciting there. In terms of back to your question though, kind of metrics that people monitor to make sure that their agent is actually performing, latency is important, tool calling adherence is important. And then there's of course business specific metrics. So for example, right, if they notice that, hey, by the way, in the healthcare scheduling setup that one doctor has just gotten gotten zero appointments over the past week, right, from this voice agent, that's a little odd. Maybe we should go and figure out what happened here, why the voice agent is unable to go and

you know, ⁓ schedule Dr. Fam, for example, for a meeting when you go into the transcriptions and you realize with Blue, my gosh, my voice agent has been transcribing Dr. Fam, like P H A as Dr. Fan, like F-A-N, which is why Dr. Fam beginning to any meetings, everything starts to make a lot more sense. So it's like these these are like very, very niche, like customer specific issues, right? That you can like if if you didn't have observability or like a simulation suite to to figure that kind of stuff out, they would just go unnoticed. It would just feel like Dr. Fam is

You know, someone that no one likes, you know.

Hermes Frangoudis (19:44)
The AI has just blacklisted that for no reason.

 Faraz Siddiqi (19:46)
Actually

just it was just a Christian problem. So stuff like that that's exciting to help you figure it out.

Hermes Frangoudis (19:52)
Super cool. And this is through like ⁓ Blue Jay SDK, you'd build this right into your agent. It would send the metrics back to Blue Jay as the calls are going on. So you get that both in production but also in your test observability, right?

 Faraz Siddiqi (20:07)
We have a fully built API suite. So like if people just call our APIs, send us the calls after they're done, and they're all good to go. The other cool thing is recently, actually I posted about it this morning, but we recently rewrote our docs, right, to be a lot more identic in the sense that now every single endpoint that we have, right, has a copy-pasteable AI integration prompt. So it's like literally like title and then AI integration prompt. And it's just a bunch of like a text that you can just copy it. For the reason being that you can just go put that into cursor or put that into cloud code or any of your like AI tools, copilot, whatever it is.

Right. One shot integrate Blue Jays APIs into your platform without even touching a line of code, which is super exciting. And I think also, by the way, like I hope this is not a hot take, but you know, something that I think people should be doing a lot more is just having AI integration prompts in their docs because like in the end, like developers, like the age of developers manually integrating your APIs is coming to an end. The age of developers clicking a button, pasting it into their cloud code instance and or you know, cursor instance or you know, open AI chat or wherever it is, right, and just shooting out the integration.

through an agent is coming, right? Or maybe it's already here. Definitely.

Hermes Frangoudis (21:10)
I

definitely agree with you that it's it's here, but not just like the agent takes over all the way. I think a lot of developers are, like you said, still reading the documentations, figuring out what they want their agents to do and then giving them that one shot prompt. Like we we recently rolled out our own Agora skills and MCPs and started rolling out prompts within our docs as well. Like we see it. Everyone's moving in that direction.

 Faraz Siddiqi (21:32)
That's so cool. Have you guys been seeing the adoption of like prompts and everything in in inside of your developers coding? Like what what what was your main push to putting in your like prompts into your docs?

Hermes Frangoudis (21:42)
A lot of it came just from our internal teams, like how we are building. And the more we talk to other developers, it's like, okay, a lot of us are still maybe on that forefront where we're giving a lot more up to the agent than we we might have done in the past. But we're also seeing that that's broken down a lot of those barriers. Like, I want to build out this feature, talk to my agent, plan it out, review the plan, have it go. And we're no longer thinking about, like you said, the the nitty gritty of

code and it's more so like the business logic, how does this fit into the overall flow of of the developer experience? And now we're thinking the agentic exper developer experience as well.

 Faraz Siddiqi (22:16)
possibilities.

Yeah. No, I think I think in going going forward docs should be written for three people in mind. They should be written for PMs, for developers and for agents. I think that that's like the the North Star for docs. It's

Hermes Frangoudis (22:31)
just

moving so fast in that direction. We're you know, we're we're having debates internally like, when is the user actually just going to give away that control level? And they won't even be reading your docs and everything's gonna be fully agentic. Like the developer experience on that perspective is I think it's still kind of like the bleeding edge, but we're moving closer and closer to it with it with good stability, it looks like.

 Faraz Siddiqi (22:55)
Yeah. I mean, even crazier of a take I think is like and I'm actually reason I I do these weekly newsletters for people who are interested. ⁓ and I just like write my thoughts just like unfiltered, you know, thoughts about someone who's building in like conversational AI every week. And like the thing that I'm writing about actually today is like about how in the in the very near future I think that you eyes will

soon become obsolete. I think that like zooming out higher level, right? I think agents as an interface towards the product you're building will become the de facto way of interacting with products. ⁓ in in the sense that, you know, if you're if you want to deal with the businesses, you know, if you want to go and schedule appointments for a business, right, you're no longer going to go and like go to their website and click the schedule button. You're going to go and call an agent that does it for you. If you want to go and use this complex product, right, ⁓ like the some SaaS offering, right, you're

you know, no longer gonna have to go and figure out how all the buttons work and figure out all these tools, right? You'll have your own agent basically interface over the MCP that their their company exposes, right? And you can just figure out what you want via natural language. And only like those who are like the super users or like people who are like really heavily using the SAS offering or the product or whatever will care enough to not use their their chat agent and actually use like the buttons. Almost like how like it's like the difference between like typing on your keyboard

keyboard versus like using like commands, right? To like do stuff on your keyboard. I feel like that's what's gonna happen with with like L LMs versus actually using a product. Like the first approach will just be, Hey, listen, can you do this for me? And if it doesn't work, then they'll go in and figure it out. You know?

Hermes Frangoudis (24:29)
Not not so hot of a take. I think ⁓ I think I could totally see that feature where, like you said, that's the de facto interface. And even as a provider, you're gonna think about generative UI, right? Like that's like kind of a new big thing. And how do you bring that into like these different agents that then create UI on the fly? Do you then as a provider give the generative UI to the agent to be able to render within its own environment? So now, like, I need to make a voice call.

Cool, I'll bring in Agora, I'll scaffold this UI right within it, and then my agent will just start connecting, right?

 Faraz Siddiqi (25:07)
Can you imagine, man? Yeah. Like judge like bit there's like this cool, like it's a I think it's called just in time compilation, right? Where you're like basically building code and doing it on the fly from a user request in a way that makes sense for him. It's like the way that I've always visualized it is like, I don't know, like if you watch these Iron Man movies, right? And like he has like he's always like messing with like his like his like internal lab with all these like holographic interfaces and he's grabbing stuff. And yeah, like growing up, you're like, okay, like this is like CGI, like whatever, right? Like

That's like just stuff to make it look cool for the audience. But I mean, maybe that's the just in time compilation of what made most sense for Tony Stark to interface with his like suits with. And that's just what it was. And that's what the thing that made most sense for his brain. And maybe that's the way that we're headed. really exciting stuff.

Hermes Frangoudis (25:52)
Super exciting. And I I have a background ARVR. And so for me, I always like when you say the the interface is disappearing, I think at some point we're gonna have screens in general disappear. Like it'll be some sort of projection based UIs that like adapt to what you're doing and and you're using your voice instead of typing on a screen somewhere, right?

 Faraz Siddiqi (26:15)
Yeah, no, I I totally agree. I I I do think that like just literally in terms of efficiency, I think voice is the modality to capture that. Right. Like, you know, there typing is is too slow. I mean, like Whisperflow has shown us that, right? And Willow and a couple other cool, cool competitors in that space. And but I mean I like the voice interface as a way of interacting with your environment is a lot more seamless. It's it's it feels like a stream of your your real thought. It's definitely a lot less clunky than typing out your thoughts. And ⁓ I think I think it will be the de facto interface.

For for literally your entire environment.

Hermes Frangoudis (26:46)
I I agree with you. I see it with my kids now. They're, you know, that next generation. They're three and six. And they talk to the computer. They're not they don't know how to type. Right. And the interesting thing about when you start talking to the AI machine, right? Like the computer, the screen, whatever you want to call it, the input gets a lot messier. It's not like a coherent, like I sat and typed my thought, I reread it. Does it make sense? Is it a compl no, it's just like stream of consciousness right in.

 Faraz Siddiqi (27:16)
So that's one thing where like I I totally agree with you. When I when I do my newsletters, it's it's zero. I don't use whisper float, there's no AI, there's none of it. It is literally just me sitting down, everything turned off except for the the the the the sheets inside of me. Just typing it out. Because like in the end I still think that writing your stuff out, you can be a lot more meaningful with what you say. It's almost like when you're speaking out loud, what you what you say is your first draft. But when you're writing you can iterate and you can think and you can

you know, like get to a cohesive narrative that makes makes sense. Good thing with with L LMs though is that, you know, when you when you're when you're speaking into it, right, when I'm like commanding my my cloud code agents to go and like, you know, work right or do something, right? I don't have to be perfect on the first try. I can I can actually give my street of thought, they can parse it perfectly fine. So actually like I think voice and LLMs are like coming into the main stage at the same time and they're working beautifully together.

Hermes Frangoudis (28:06)
Yeah, it is just like that perfect catch all. Like it it can catch every stream of thought, whether it's super coherent or kind of goes like around in a circle before it comes to its point. Like these things will just sanitize it and clean it up. And it's amazing because that now unlocks like a whole area for for businesses to interface with their customers, right? Where traditionally you would need a physical person, you'd have to rely on sustaining that cost. Like how often do you need to call them? How many times do you need to have the agent call out?

So yeah, it's it's amazing where these things are conversion. So when you think about all this testing, it's not like a one off thing. Like we're not testing every so often. This is like CI C D type tests. You're running these on the regular basis. So what kind of cadence are you seeing with your customers in these sort of observability and testing?

 Faraz Siddiqi (28:53)
Yeah.

So for simulations, we have customers who run heartbeat tests every 10 minutes, 24 hours a day. Right. ⁓ and these are customers who have like really like strict, you know, SLAs with their customers on uptime, right? Where they really need to make sure that their agent can handle things like at any given time. Right. So yeah, so there are there are customers that are deployed that we test for and monitor for in like really critical environments. So like, you know, in emergency centers or in, you know, like healthcare centers and stuff like that, where like they need to be available.

Because them not being available means that someone's gonna get impacted. So so that they're basically like really, really strong SLAs, right? Then there's also people who do, you know, CI CD based integrations like what you were saying, right? Where they have like this like suite of, you know, a hundred or two hundred digital humans that they've pre-created on the platform that they really like. You know, and of course Blue Jays helped them generate those, right? We generated from the context of their system prompt and give them the code coverage feel and everything, right? So with a combination of like taking Blue Jays generated digital humans and also adding their own interesting ones and also right.

Taking digital humans that were generated from production failure cases, which is something that Blue Jay does. It's a feature called replay. So we literally take a production call that has failed and we isolate the customer utterances that caused the failure and we replay it back as a digital human to your agent. So you can reproduce a production issue you didn't know about and make sure it never happens again by adding it to your test suite. Right. So it's super, super cool stuff, right? ⁓ so it's almost like a cool feedback cycle, right? But but yeah, we we we do that kind of stuff. They take these these test suites and they run it whenever they push to prod.

Right. And it's it's a blocking step in their CI C D flow. They're like, hey, by the way, like we cannot push to prod unless, you know, the certain, you know, exit flag is hit for for for a production agent to not go down. So it's a really exciting way to first of all make sure that like, you know, whenever you surface an issue in production, right, that issue does not get surfaced again by your customer because you don't want your customer complaining twice, right? Complain three times, they're gone. Right. ⁓ and the the second cool thing about that is that, you know, you also have like confidence in, you know, regression testing.

Right. So like if you add a new feature or you change your system prompt, you make whatever targeted change you want to make, like how can you be sure without a test suite, right, that like the system prompt change you made at, you know, the first couple lines of your prompt didn't impact the functionality that is normally impacted by like later lines in your prompt. You really don't know, right? Unless you have that test suite. So so that's why like another cool, cool way that people use BlueJit to do is like to almost add determinism into their voice AI development, right? To literally have a test suite that makes sure it does what it's supposed to do.

before it goes to production.

Hermes Frangoudis (31:20)
Suite that l lets you sleep at night as a developer, right?

 Faraz Siddiqi (31:24)
Sweep unless you sleep at night. Exactly.

Hermes Frangoudis (31:26)
So what advice would you give to a team moving from prototype to production for the first time?

 Faraz Siddiqi (31:32)
So if it's the absolute first time and you have like your first customers and you're doing like ten calls a week or like you're doing some like really small volume, right? My my honest advice as someone who is building an eval platform is to not use an eval platform. Right. ⁓ and it's it's it's seriously that because I think at the earliest stages, the absolute number one thing that you could do to make your product intuition better is listen to every single call, right? And really just understand your customer, understand your customers, customer, basically figure out

you know, what you need to do, what prompt changes you need to make. I think it'll make you better as a founder, right? If you're able to understand your customer at that deep level. The time where I'd start thinking about automating this process, right, is when you've when you when you're doing the process over and over again. So like maybe you're doing like, you know, maybe you onboard a couple more customers and now you're doing like a hundred calls a week, right? And like you're sure that your agent, like you're very, very sure that like you know, you've learned everything that you need to learn from like the the flow where your agent orders a cheeseburger.

Or like receives an order for a cheeseburger. You don't need to test that anymore. You just need to make sure it's not broken, right? You don't you don't need to like you basically don't need to learn anything more from that flow, right? Therefore it's ready for outsourcing, right? And that is when you use a platform like Blue J, right, to basically run your simulation suites and test suites and make sure there's determinism, right? In that testing flow. So that's like the first part. Add simulations when you have hit that point where you are not having to learn from every single call. You want to basically replicate to make sure those things don't break.

Right. So that's that's really important. The second part is that when you want to iterate fast, you have deadlines coming up, you have a demo that's coming up really fast, right? That's another really cool place to use Blue Jay because there are GTM teams that use Blue Jay to spin up demos faster, which is really cool. Right. They just like, hey, shoot, like I have this demo coming tomorrow or in two days, and I need to like make sure the agent is like really performing because I had pass this off to a customer for an entire week and they can use do whatever they want with it. So I'm gonna go and run like five hundred simulations with Blue Jay, right, to test all these different things and then ha have like a really solid demo agent.

For, you know, like someone's customer, ⁓ so they can have a higher chance of conversion, right? So that's another use case. And then the final use case and like the most massive use case, of course, is observability. ⁓ this is like, hey, by the way, now that I'm done testing my agent, I need to go put it into production and let it kind of like handle real customers. And, you know, like regardless of how much you thought about it, regardless of how much you tried covering your system prompt, there will be cases like edge cases in production that you reach or that you that your agent faces where it will fail. So the best thing you do in that scenario is capture those failures and make sure they never happen again. Because if

You keep on doing it again and again and again in production, you're not catching these issues, right? No one's telling you it's a problem, then your agent's not going to improve. Right. So that's why I think like that's like the ultimate use case of Blue Jay. If you want to build an agent that actually gets better over time, you should be using a testing platform like Blue Jay, ⁓ to help you get there.

Hermes Frangoudis (34:15)
Totally makes sense. So in the early days, don't skimp on the review. Be the one in there reviewing it. And then once you're sure, okay, this is this is the eval I need, that's when you can call Blue Jay and be like, actually these are the things we're testing against. These are the areas we don't want to ever break because we've figured out their paths. Now be the catch all and tell us where it's breaking and what's causing that.

 Faraz Siddiqi (34:20)
Exactly.

Exactly. It's like w when you're when you're at the stage where you wanna start scaling and you wanna start improving your agent in like a programmatic pipeline, that's when you use blue chain.

Hermes Frangoudis (34:48)
So do you think like we have the synthetic users in Blue Jay and they enable your customers to run all sorts of tests? Do we now have like the ability to then replay those through another AI system that can like point out improvements to you? Like how how do look at some of that?

 Faraz Siddiqi (35:05)
Yeah, yeah. So like this this concept of replays is really interesting. So what we do is like, you know, BlueJ like handles like millions of calls a month. We work with like massive customers like Google and DoorDash and you know like public banks and like really, really massive enterprises, but also like you know, fast growing startups like Eleven X and Aurelian and and and like some really interesting like you know ⁓ use cases of voice AI in the startup world as well. Right. And you know, for our customers, what we do is we allow this feature, we have this feature called replay.

Which takes a production call, all isolates the customer utterances that caused that failure and creates a digital human that replicates that conversation in a simulated environment. So very simply put, we take a production failure, we bring it into our simulation environment, and we allow you to reproduce the problem with your agent, right? So that you can fix the agent with multiple iterations to make sure it never happens again. Right. And the idea there is that like you're literally creating like, you know, if you had like a a test suite with 10 test cases and you put your agent to production.

find like fifteen failures. Now your test suite better have twenty five cases because you wanna you wanna make sure you never run into those issues ever again. Right. ⁓ so we and we also of course provide a lot of tools to help you be diligent there. We identify the failures, we send you alerts, right? We have like a one click button that allows you to replay the issue. So there's a lot all there's a lot of platform ergonomics that really help with this. But yeah, like the we we basically give you the the tools to be able to to be really re to help you be really, really strict about an improvement process for your agent. No, that makes sense.

Hermes Frangoudis (36:29)
As a developer and and a business, you want to be able to like really understand that and then expand your test cases. Like you said, there better be twenty five or maybe even thirty cases now to bet better accommodate and make sure you don't break that.

 Faraz Siddiqi (36:42)
There's couple of cases where like people expand their test case. so the first one, of course, is like production failures, right? That causes like them to be like, okay, like some issues here. I need to go and add them to my test suite ⁓ and make sure they never happen again, right? That's the first one. ⁓ and the second one is like, hey, like I'm adding new features, right? And I need to go and make sure I add this, you know, test cases that cover those features as well. Right. But like if we zoom out for a second, like what we're doing here is we're creating like a like, you know, just how SAS has like a test suite that like makes your product

you know, like rigid in in a sense that like, hey, it's not gonna break in these ways, right? Or almost like with this test suite in voice AI, adding determinism adding determinism to the entire structure. Right. Where like what is traditionally like this like really malleable, non deter non deterministic flow that's governed by a system prompt to these handoffs, you don't have much observability to the stack, you don't know really what's going on, right? To like a system where it needs to follow these constraints. And if it passes these constraints, it's gonna do what your customer wants. So it's almost like a nice like it's a very, very structured way.

about thinking about, you know, developing something that is inherently unstructured.

Hermes Frangoudis (37:43)
Makes total sense. And it's just gotta become part of the process at the end of the day. I did want to circle back to a comment you made earlier about go to market teams testing out. So it's not just developers, right? Like you're talking about teams that are going in front of customers, making sure that their demos are working, which we all know how hard that is. Live demo is like the gauntlet.

 Faraz Siddiqi (38:04)
No, it is. There's an interesting use case there. So there's a couple a couple customers who use us today, right? And they're out of like their team, like their GTM teams are actually like, you know, constituting a majority of their usage for simulations. And like the reason why is because like, you know, when they're selling their agent to like a new customer, right, or like a modified version of their agent, a customized version of their agent, because their FTE emotion is really big, right? to a new customer, they spend a lot of time on Blue Jay, basically running sims to make sure that that

You know, that demo agent or like the agent that they hand eventually hand off to the customer, right? It doesn't even have to be the demo, right? Is really, really solid for that customer's use case. So they go and ask Blue Jay, they're like, hey, by the way, this is what I'm planning to build. Like here, like what are some goals for this agent? Blue Jay like goes like figures out goals, figures out the code coverage setup, right? We go and create these digital humans, they go and run the digital humans to almost anticipate what you know, the like the ways that the customer would go and test this demo, right? We find failures, they go and say, Okay, these are some problems, I need to go fix this.

They make the system prompt changes, they run it again, and they keep on like kind of iterating until they get to a certain percentage pass rate with Blue Jay. Right. ⁓ and then when they've hit that, they're like, okay, now I understand what's going on. Like I understand this agent's gonna do at least the following 200 things really well, as opposed to the following three things really well if they were to do it manually. ⁓ so it's a huge, huge like increase in accuracy and just like demo quality for the agent that they're they're creating if it's tested with Blue Jay. And the second part of that is that with Blue Jay, you can generate reports, right?

That actually like testing reports and like PDFs, like they're really nice, they beautiful, right? That you can actually help use in GTM process. So it's all it's like the first part of it is like testing your agent with Blue Jay. But then after you test it with Blue Jay, Blue Jay becomes a sales tool, right? Because you can go to your agent or your customer and be like, hey, by the way, this agent that I built is not only just something that like I've come up with and I've just like sat down with a computer for like a couple hours. It's like, no, dude, I like I have like a third party testing platform that has verified its like functionality for these cases. So it's like Blue Jay approved, right?

and here's a PDF that shows it, right? Also, like it's a very nice selling point to like a lot of the clients. They were like, okay, you know what? Like these guys, this this company that I'm interfacing with has put in effort. They're not like, they're not just building this out for me and throwing it out to me, right? Then they didn't just tell Cloud Code to build it and sit and send me whatever like, you know, stuff came out the oven, right? They they spent some time, they took a testing approach to this, like test driven development. They really like went through the cases. I could tell that they looked at all these cases and it made sense, right? And here's like proof, like literally a a PDF proof that they did all that.

Right. So it really, really helps in sales.

Hermes Frangoudis (40:29)
That's huge. Building trust not only within your own customer, but the customer's customer. Exactly. Or potential customer. So we are getting to the top of the hour. Amazing conversation. And I want to be very mindful of your time because I know you got a lot going on. So I appreciate you taking the time to talk to us. I have one more question. It's kind of like our wild card question, a little bit different from whatever we've been talking about. So what's the most surprising or chaotic failure you've ever seen?

 Faraz Siddiqi (40:35)
Exactly.

Hermes Frangoudis (40:59)
in a conversational AI system. Something that's made you rethink the approach to to testing and agents in general.

 Faraz Siddiqi (41:06)
There's one that's much twine, but I can't talk about it. I'm trying to think about another one. Nor normally when you're a testing platform and your agent fails, you don't want the testing platform to go onto a podcast and talk about the failure. So that's that's definitely something that that we w I wanna be mindful of. trying to think of like funny scenarios. I think the most catastrophic one is has been the one about latency, right? Where like if your late your average latency in a conversation is over a certain amount, that can mean that your nodes are

you know, like absolutely gonna go like super, super like like overburdened and then they're gonna all go and fail and everything. I remember like this even happened to us one time when we were like running a load test like for for one of our clients. We were like running like, you know, like all too many calls at once, like something that our our our system didn't like scale up to to handle properly. So we had to you almost scramble when we were alerted by our own alarms, right? That told us that our nodes were going down. We're like, my God. So we went in and we had to had to go and like go into command center mode and like, you know

Just just really spit up all these nodes really, really fast. So that was pretty exciting.

Hermes Frangoudis (42:08)
It's one of those things as a founder, you're like, It's good and bad. It's good that you're getting all that business, but it's bad because it's like, man, I gotta make sure I keep up with it.

 Faraz Siddiqi (42:12)
Yeah.

Yeah. And and to that point, there's a lot of crazy stories as like a founder. I mean, like there's like I mean, just this weekend, right? So along with our one year anniversary, this is probably gonna get uploaded after the fact, but along with our one year anniversary, one thing we're gonna be doing tomorrow is releasing our ⁓ our new brand. So this is actually the first time that I'm publicly talking about it. So it's pretty exciting stuff. The Blue Jay is getting a new logo, a new like kind of typeface. The mascot is evolving, right? And

as part of this release, and I feel kind of comfortable talking about it now because it's gonna be post afterward, is like we're releasing something called the Blue Jay world, right? Where like you can literally like dive into the world of Blue Jay. It's like these really, really cool like, you know, like structures and things and a lot of visual elements and that'll then guide like kind of like the future of Blue Jay, at from like a UX perspective, right? And UI perspective. And like one of the core things there is like you need to dive into the world of Blue Jay. So what Rohan and I did is we we donned our classic Blue Jay suits, right?

⁓ and we literally jumped out of a plane yesterday. ⁓ and we just like literally went skydiving in our blue jay suits as part of like a promotional thing of like, hey, like, you know, like we're diving into the world, the blue jay, and you should too. So this is gonna be like a pretty nifty, nifty video coming out there soon. that's that's gonna be pretty hilarious. Like two dudes in a suit just like trying to fly, setting a world record for for altitudes on on all for blue jays. So yeah, it's it's pretty exciting stuff you get to do you get to do when you're when you're building in this space.

Hermes Frangoudis (43:43)
Amazing. I can't wait to see the video. Thank you so much for us for your time and I want to thank our audience for following along and listening along. Do that social thing. Like, retweet, follow, and listen out for the next one.

 Faraz Siddiqi (43:45)
Yeah, it'll be pretty funny.

Thanks for meeting. Nice to meet you again, Hermes.