Rishi Ahluwalia (00:00.034)
You are now one of the most well-integrated DTS systems in Agora's conversational AI entrance.

Ankur Edkie (00:06.944)
If in lab setting we have proven that it works and it can easily pass the test, why does it not actually pass the test when it comes to real world use cases? The final golden solution that everyone's hoping for is again a speech to speech system where you have everything solved. With this efficiency, we can provide the lowest costs on TTS across the board globally today.


Rishi Ahluwalia (00:35.81)
Welcome to the Convo AI World Podcast. In this podcast, we interview builders, developers, and teams that are building solutions for the conversational and voice AI space. Today's episode is really special.

We've been joined by Ankur Redkey, the CEO of Morph AI. Together we'll be diving into a topic that doesn't get talked about enough. The hidden complexities of enterprise voice from scaling voice AI systems to making them feel truly human. Ankur, it's great to have you with us. Welcome to the podcast. Great, great, great. And Ankur, let's probably begin the different sections that I've planned for you today.

Ankur Edkie (01:08.91)
Thanks, Ashi. Pleasure's mine.

Rishi Ahluwalia (01:18.742)
I think it's mostly starts with your journey. Angul, I think you have had a long journey and I think you've gone from model development to building Merph AI as a voice AI generation platform. So I think I'm very curious to what pulled you into this space.

Ankur Edkie (01:34.766)
Yeah, we started almost five and a half years back with Murph and that's a time that predates chatgpd launch. So kind of predates the AI hype to that extent, were early movers to this space. At that point in time, if you look at the state of voice AI and speech synthesis overall, honestly, it was disheartening. Text-to-speech was stuck in the space of being used for just IVRs and automated bots. People weren't even trying to pretend.

to be humans at that point in time. And of course there was no use case where it would be used for anything other than IVRs. So it was like a huge, huge gap that I saw in what could be and what existed in the world. And I was very curious. Thumbs was just the pace of research back then, the way it was moving. And I felt like voice that was actually being left behind. A lot of research was going on in text.

Even images, there were a bunch of models that had come out, were interesting. No one had yet somehow picked up voice and I was really curious to take that up and solve this challenge of making voice accessible, usable to begin with and fast, of course.

Rishi Ahluwalia (02:45.17)
Absolutely, Ankure. I think the pre-Chat GPT era was, must have been very, very tough. And I think it probably made things easier after that, but it's still a evolving process. So thank you so much for your thoughts on that. And I think there's one specific thing that the users will also want to hear from you is that what specific problem you became obsessed with solving. It's obviously we understand the Merph's portfolio, but I think there has to be...

something more to it.

Ankur Edkie (03:14.894)
Of course, I mean, it's, it's all about the Turing test. The past one and a half years, I think everyone's convinced we as a, humanity have crossed that bridge, decades back was thought, I was unthinkable. You would have a conversation in a majority of people would know that it's actually not a human. Latest benchmarks are at like 75 % people can't, not just the voice layer, but the entire conversation itself, feeling meaningful. And that was really our mission from day one. Basically make voice such that.

You pass that test, you are in a state where it's actually something that would append what humans could do at scale. A voice is sort of, they're always stuck in terms of scalability, not just scale with your servers and at what scale we are doing operations of customer support and whatnot with voiceboards, but actually scaling content creation, ability to sort of like use it in places that bots had no chance of actually succeeding. And that was the...

main motivation where my first goal when I started the company was to actually see it played in an advertisement because that's 30 seconds, you still have wiggle room, it's not a one hour movie. So that was my first goal. Like can we recreate some of the iconic ads with the eye and get humans to tell us whether they can actually figure it out and does that do justice to the script? The tonality, does it actually pass for a human as well to people who are experienced in the field, not just layman.

And that was what we started with as our goal when we began this company. Of course, over time, Devolution AI actually like speeding up so much. The use cases evolved as well and finding product market fit. As that evolved, we built several products. As you know, in our portfolio, starting from creators, now we're focused entirely on Conversational AI as well, which seems like the next biggest unlock that the world can have in the voice space. So we're very excited about that now.

Rishi Ahluwalia (05:10.646)
No, indeed, Ankur, I think everyone's excited about conversational AI and the value adds that obviously, Merph is offering and the industry is kind of offering. So I think I truly believe that what you're offering right now might change every month or few months and new things would be something that everyone would be appreciative of. Right. So coming to my next question, Ankur, right? So at what point did you realize that voice AI is much harder than it actually looks?

Ankur Edkie (05:40.75)
Well, it happens with every new product that you create is the third big product launch that you've done. Honestly, we've had that revelation in every one of them. The first one was with creators. And as I mentioned, of our goals was to be able to create an advertisement, which would actually pass the bar. And we were pretty happy with what we had produced as an output. And then we showed up, actually a professional director who had done this professionally and sought his feedback.

we were amazed at what kind of subtle insights humans still actually hold on voice. We as technologists didn't actually first think about some of those things. And, and that director actually gave us a full recording of how he would do that take and where does AI actually differ from how humans perceive some of those things. And that was my first foray to where the code standard is. and again, that translated in the next product as well. We built something for dubbing, localization.

and the nuances of it. So every time we've gone from a demo to actually taking it to enterprises and to actual industry exports, there's always been a learning curve.

Rishi Ahluwalia (06:46.456)
That's an interesting take, right? Because these professional directors definitely have much bigger iota of data than we can probably imagine because their human interaction is varied across so many different professionals that they deal with. That's a great take to this, you so much for that. My next question is, I think this is something that everyone is still contemplating, right? That what do you believe is fundamentally broken in today's voice stack?

With so much evolution already taking place, what do you think since adoption is now taking place, it's going from POCs to actual deployments. What do you think is still broken?

Ankur Edkie (07:24.974)
I think the promise is well understood by everyone already. The promise of the Turing test, the ability to be able to easily mimic a human to a certain degree of confidence. All of that's been proven, mean, proven pretty scientifically. So there is no doubt that the technology exists. But if you look at, as you said, enterprise adoption and where people are and how the demos actually work in production, or even if you've had a conversation with a voice board in reality while perhaps booking a reservation somewhere.

You still, you won't say it's passing the Turing test. You would know it's a bot and you might probably be forgiving because it's a bot. But that's an interesting question. If in lab setting we have proven that it works and it can easily pass the test, why does it not actually pass the test when it comes to real world use cases? And the problem space is wide. The gaps are in several spots in the whole pipeline, of course. The final golden

solution that everyone's hoping for is again a speech-to-speech system where you have everything solved in one nice little bow and you have perfect tune calling and whatnot where the board can do a lot more than just have a conversation. That's of course the place where we think industry is heading and in the next couple of years you would see some breakthroughs where it will actually be a reality for actual enterprise adoption but in the next two to three years we still think the current stack of an LLMT TLS TT is still the most viable option.

And we're going to continue to try making it as close to perfect as possible. And the main gap in that system continues to be, feel one of the reasons why there's a gap in the industry is still the fact that it's independent entities, almost companies and sometimes vendors, providers who are actually interacting meaningfully with each other. You have three independent stacks. There is hardly any...

passing on information across the board. You're just relying on text. There's a lot, more than of course, even as a Text-to-Speech provider, I can benefit if I knew what the customer said. As a, in a hubbed voice bot system, as a Text-to-Speech provider, I'm just getting a text. I don't even know what I'm responding to. And that's what most voice bot architectures today are sitting at. There's a huge upside in owning the entire system, having the information passed along and models capable of using some of these insights while responding at each layer. So that's shortly going to be filled.

Rishi Ahluwalia (09:43.394)
That's correct. That's correct. think every, every model or every system probably has its pros and cons, right? So there are many pros to the cascading model as well, while there are, think, thank you for being honest here, I think that's what makes it very different. All right. And so let's talk about perception versus reality, right? So there is still an illusion of stalled voice AI. So why do voice demos feel magical?

but production systems feel frustrating. I think you've already answered that in the last question, but I think scaling is still that the industry has not completely envisioned or you know what the stack behind is, it's not completely figured out. So why do you think voice demo still feels so magical while production systems do not perform as well as the voice demo? Although that is evolving, that is getting better, but we like your opinion on the same.

Ankur Edkie (10:39.47)
There are definitely several factors to it. Assuming all the physical factors have been kept the same. First being most demos are conducted on laptops and whatnot. You just have a much better info going on and you're not actually testing on a eight kilohertz bandwidth telephony and you're not actually doing those tests in real world scenarios. Even barring the brains of the system, assuming they're the same, you have the same LM and the stack is the same. Just the acoustics itself matters so much.

Just the device on which you're playing things matter so much. And in real world, just the variability that exists in real world, right? Not, I'm not even going getting into the nitty gritties of how the elements are behaving and the language nuances of whether elements perform equally well on each language. Of course, there's a whole spectrum of where elements also don't do justice. And every prompt, course, length of the prompt degrades systems. And there is a lot of research on what's optimal context length I should be using. If it's a, if you provide a million.

Token context, should I be using the entire thing? Will I still get the most? So there's a whole study of that, course, that at a 50 % context usage, perhaps the best sweet spot that you can attain and barring all of that, if you just think about the experience on the speech layer, just the diversity of devices itself is going to break your production systems, right? You don't know what device people actually playing your audio on. You don't know the acoustics of the room they're playing it in. Your ASR depends on whether it's being recording stuff from your car or whether it's recording in a quiet room or...

The room dimensions are very small, so you have like huge booming acoustics going on there. And that variability, course, in a demo, not going to be testing all that, right? You're really in a controlled environment. So in a very optimal setup, of course, it beats all the Turing tests of the world. But even a simple jitter cascades, you get a few words wrong in ASR, your LMS have to make up for it. TTS kind of like can't do much about it anyways. So yeah, I mean, it's just the reality of how complex.

the real world is.

Rishi Ahluwalia (12:36.918)
Yeah, think absolutely right. Right. So I think with the advancements in technologies and the new things that are still coming in, fundamentals and dynamic remain the same. So the complexity of the use cases, the complexities of how you're using your systems kind of remain same. It actually just widens the capabilities, but I think sticking to the dynamics and fundamentals still make a lot of difference. absolutely. Thank you so much for that.

And I think one topic that everyone's still kind of obsessed with is latency. And that will continue to be discussed no matter at what stage we have reached in conversational AI or the voice AI space. So is lower latency always better, Ankur, or is there an optimal human threshold?

Ankur Edkie (13:22.638)
I mean, if you stretch it to an hyper bowl, of course, you don't want to respond within a hundred milliseconds with all your hops. But of course, in reality, we're nowhere close to that threshold as a voice system. When you have these hops, there have actually been tests on this. There is documented publication. Sub-hundred milliseconds response time to a question is a very obvious giveaway that yes, you didn't even listen to what I'm saying is what the most obvious.

sort of feedback would be, but in reality we're very far from that threshold. So having said that, every single customer that we have interacted with, enterprise or even non-enterprise, they always want all their systems to be really low latency so that they can give most of their bandwidth to LLMs for increasing the reasoning ability. That's just obvious. You have a budget, get responses within 500 million tops and...

And you want to give most time to your reasoning lab so that you have your own waste time. want to close the conversation in a time bound way, right? And you want to sort of stick to the agenda and there's lots more that we need to do within the logical layer. So I don't see that demand for low latency going away anytime soon. Rather it's just increasing as more complex voiceboards get built. You want your budget to be given to alums.

Rishi Ahluwalia (14:46.526)
Absolutely. And I think that's where your latency comes into the picture. So sub 130 millisecond first audio offered by Morph is I think it's a game changer for the industry as well. And like coming to that, right? So again, we'll be able to put more light on this. Like what matters more? Is it like time to first audio continuity or jitter? Or again, is it still dynamic in nature?

Ankur Edkie (15:12.792)
Of course, all three matter. Some of them are, it's like some of it is bread and butter. You can't have a system without having continuity, right? And some things are just obviously necessary. Time to first bite is where most systems in production today struggle. So that's what is spoken about a lot more. Again, jittery audio. I think there are different kinds of jitters. If it's actually DTS system jitter, which is rare. I think it's a deal breaker, of course. Network jitters are of course,

something that humans are accustomed to. Even if you're talking to a person, you know that there's going to be some kind of disruption in what the person is saying. It's not as big a deal breaker in my head, at least from telephony system perspective. People are aware that humans also face this. But yeah, expectations from an AI system are generally higher than from humans. Also, this is something that gets overlooked a lot. If you get people to evaluate humans and score them on some of these things versus validating AI, you would be more stricter with AI's.

This is expected to work.

Rishi Ahluwalia (16:13.242)
True Ankur, I think I'll probably tell you a quick story right now. So I've been interacting with a customer and what they asked me is that the combination of Agora and other cascading things that we have in the engine, right? Especially with mop as well. The quality of sound sometimes feels studio quality, right? And that is not something expected for every use case that we are running. So they actually wanted to insert noise, like while we're talking about noise cancellation, latency, I think.

Could you mention network problems as well? Sometimes people want that slight noise, slight jitter to actually be in the call. So it feels more natural because sometimes it can be studio quality that we are actually emitting. So sometimes quality can be a problem as well, while people want to have that more naturalness of the environment into the picture as well. So absolutely. think it's probably a balance of everything that customers are today looking for.

Perfect. Thank you so much. So again, one more thing, right? Since you're talking about the cascading thing, uncle, right? So RTC, ASR, LLM, DTS, this is the usual stack. And in the usual stack, sometimes the real bottleneck can be the turn taking process. Right? So one thing that's underestimated is obviously turn taking. So why is knowing when to speak harder than generating speed?

Ankur Edkie (17:32.622)
Actually, I mean, as a science problem, I wouldn't agree that it's harder. I think it's actually understudied. I think I would put it in that bracket of things, but it's a new problem. We didn't have the use case of building voiceboards before LMS could produce meaningful outcomes, before TTS could sound natural enough. So it's a problem that's actually coming to existence in the past two years. TTS problem has been there for 30 years.

40 years, all the way. So it's just the infrastructure, just the data sets you need to train on and all those sort of behind the scenes things are missing for this piece. The only thing I think that helps me, which makes it harder a little bit is that most providers would want to do it on a CPU stack rather than invest on an accelerator. So you probably don't want to spend on GPU for this specific problem statement because it's within the RTC layer most of the time, which is not using accelerated compute.

That makes it little harder and you have a time budget of under 20 milliseconds most of the time that you want to actually allocate to this task. So from that perspective, it becomes harder because as humans, we figure out turn taking with a lot of foresight. We have a history of having those conversations. There's a reasoning model going on in our heads on whether this person actually wants to speak more or not. It's not just looking at past.

three, four milliseconds of frames, which most current models on turn taking do. They are looking at just the last few seconds. And that's just looking for pattern matching in terms of whether this kind of pattern, how your pitch contour looks like. Those are not enough. There's a lot more when humans think about, are people going to say more? Even if you think about stuff like address, right? I'm going to say the first line. As a human, no, that's not a computer address, right? And your turn taking machines are not as capable as an LLM to figure those things out. It's a simple model too, because you have

very limited CPU and very limited time. So it's just the complexity of problem is much harder and one of the ways to think about it is also to try to delegate it to LLabs. There are some ways you could, which are still being explored on how do we do this better with LLab itself rather than trying to have models, micro models for each of these tasks.

Rishi Ahluwalia (19:41.582)
No, that's, that's absolutely right. think understudied is where the problem exists as of now. So hopefully there are better solutions available. think, and, uncle, we've been talking about LLM as the reasoning layer, the intelligence layer in the stack, right? So, but do you think there is a way that DTS can also be designed for incomplete thoughts and overlapping speech?

Ankur Edkie (20:05.056)
incomplete thoughts and overlapping speech. So overlapping with the customer.

Rishi Ahluwalia (20:13.844)
Yeah, apps, actually it can be both ways, right? So let's say our customer spoke something and immediately it got overlapped by the TTS engine. Again, I'm not saying it's the issue only with the TTS engine. can be the ESR layer as well. It can be the interruption handling within the stack as well. But are there ways where TTS can be used to avoid such scenarios or maybe, you know what, identify such scenarios?

Ankur Edkie (20:40.942)
Yeah, I mean, as I said, I think in the current way these systems are being built, most video systems don't get the insight into when the customer is speaking or how the customer said those things. So we don't actually get those audio frames at all. So there's an element of time to it as well. Like when was this said is important? We're talking about milliseconds of when something is going to be So interrupt handling so that doesn't get delegated to the speech layer.

today at least in this current hopping system. But yes, if we merge some of these into one, so you could easily have a system which is just ASR and LM combined for example. So you're looking at speech as an input and text as an output. So you merge some of those gaps where you know whether or not I should output a text. So I would still say it's something that gets solved a little further up in the chain rather than going all the way to TTS to respond or not. Interruption is where if a signal is provided to DTS,

Yes, some of these systems can do a little better job of getting interrupted to your point. The TTS system is being interrupted in this case. Currently, it's an abrupt pause, which is you just flush whatever you have as bytes, which could be in middle of the word. Humans, most of the time, if they were speaking instead of a bot, they would likely at least finish their word, if not the entire thought when being interrupted. There is some bit of intelligence in how we face interruptions. That intelligence, of course, isn't there yet.

whether it's the most important thing to be solved and people want that, that's to be seen still. But I would still argue most of the problem statement kind of gets solved upfront, beginning of the chain.

Rishi Ahluwalia (22:16.566)
Absolutely. think that that's fair. That's fair. And I think the next topic is, think, very close to your heart. Ankura, let's talk about the Falcon architecture itself for the Morph DTS. So I think there is a specific term called compute acquisition. So what does that actually mean in the model design?

Ankur Edkie (22:39.138)
Yes, we had to almost coin that word for TTS at least. People don't speak about it in terms of efficiency because what we really mean there is simply put it's minutes of synthesis per dollar. And I want to put it as per dollar rather than like a per teraflops or something like that because not in terms of compute, but actually the cost of compute. And that's important because for us as service providers, for consumers, user APIs, there's just lots.

that matters to each of us in terms of what the compute cost of this synthesis is. And there two factors of it, whether the synthesis can be done at all or does it always need a much larger GPU that's a defining factor of what it can at all use. And the second factor is when you do use it on a smaller or a commodity GPU, what kind of concurrency and throughput can you actually achieve on it? So both of these factors are important when we say it's most efficient. How it plays out in real world or why?

It really matters, it's multi-fold. Of course, with this efficiency, we can provide the lowest costs on TTS across the board globally today. And that's not cutting into our cost margins. it's that efficiency is what gives us that cost. Second big thing that we can achieve with this is our data residency. If you look at the data residency that any of our competitors provide, we can easily beat them. We do beat them globally. We provided 11 geographies right from the day we launched the system itself. And again.

only possible if you, because you're not going to have the same efficiency in every chocolate if you don't have the same scale to operate at. You want your systems to still be up and running, be able to handle scale when you don't yet have that kind of demand gen in each region. And that's one of the reasons because of commodity GPUs, because of the ability to scale it up and down quickly, we are able to achieve that. Parameter that brings in is on-premise unless...

You're able to host on a commodity GPU, your customers are not going to be hosting on HNRIDs. They are not going to have that similar kind of scale of DTS requirement that they would host it on such a large machines. Again, most of our competitors don't actually succeed at on-premise where end customer actually sees their costs drop than actually go up when they go on-premise compared to being an API. And so all of these only can be done when you really are efficient.

Rishi Ahluwalia (24:53.742)
Absolutely. think all these points collectively solve that problem, right? Data residency is anyways becoming very, very popular and very, very mandatory now, even with India's DPDP Act, things are going to get stricter in the region as well. And I think with the geopolitics that is at the moment, things are going to get more stronger. So think you're already well versed to supporting customers in that space. The other thing, Ankur, is I think we have talked about the cascading flow, that how important is the LLM layer.

But there are LLM providers also who provide TTS, right? But usually they are not able to work as well as real time TTS systems such as yourself. So where is the gap in those situations? Since they own the intelligence layer, people would think that they can provide other bits of this stack as well, but they are not able to crack that yet.

Ankur Edkie (25:46.572)
So there are two solutions of course. You have a pure plate extra speech system and then you have a real time LLM itself who can generate speech audio tokens right off the bat. If you were to solve this as a single system, you are hit with several challenges, right? If you are an audio native LLM, you're basically thinking in audio tokens. You're not actually thinking in terms of speech or text tokens. And that just scales up the model size and the requirements of the model. So you right away are exploding a model.

for stuff that could have been easily broken up. So you want your acoustic system and your semantic systems to be different because they work at a different scale of token size entirely. And all the research in the past year has been to actually shrink audio tokens as much as they can to bring them somewhere closer to text tokens. a lot of tokens that you need to represent a second of audio. And if you transcribe in your text and how much tokens you need to actually store just that text, which is perhaps just a sentence, its orders are magnitude higher right now.

Audio native systems find it really hard to hit the same latencies. Of course, if they were to build a text-to-speech system and actually care about conversational AI and text-to-speech only system, maybe they can get there as well. Yeah. Their goals are AGI, so they're probably not going to...

Rishi Ahluwalia (26:56.59)
Go there. perfect. think, yeah, that that's the kind of complexities that are solved with different stacks. Right. Now talking about the full stack reality. recently, Agora and Merph, we got into this partnership, right? So you're now one of the most well-integrated TTS systems in Agora's conversational AI engine. But let's zoom out to the full stack. So why is TTS alone insufficient?

for Enterprise Voice.

Ankur Edkie (27:27.316)
Of course, mean, in a voice agent setup, there's lots going on. I mean, right from the beginning. And as I said, in the beginning of our conversation, effect of cascading errors is dramatic. I mean, an error made by an ASR system just completely throws everything off. An error made by the LLM system, slightly less of a problem. An error made by a TTS system, slightly less of a problem compared to that. In terms of you're going to make a word level error, you're going to make a concept level error and...

is that it's going to make a completely different topic level of an error. So that's where the entire stack really has to play well and they have to play well together. Where integer systems really need to excel though is the ability to be consistent in their latencies, in their performance, in their accuracies, pronunciation, and be a very reliable system. That's what we tried to basically, we tried to achieve with Falcon where it's a little bit like...

Uber wait times, right? If you have Uber right next to you, do you actually get that when you book it? You don't. They actually try to manage your experience to everyone gets at a five minute latency. That's similar to TTS. You don't want a conversation where some responses are in like 50 milliseconds, some are at 200 milliseconds. Just that even if everything is averaged at 150, you'll have a much better experience in the conversation. And we measure that all the times. The coefficient of variance that we measure in our latencies, that's the lowest in industry at a 0.17.

And that's very important to have consistency in how we speak to sound more human. just like few requests are really fast, few are really slow. So individual players have their own, but the stack of course have to really play well together.

Rishi Ahluwalia (29:00.812)
That's absolutely right. think thanks for emphasizing on the transport layer in the ASR mode, right? Because that's where maximum errors can happen. And if customers have the luxury of having a single provider for the real time transport layer and ASR, like what Agora offers, and then a reliable DTS system like MERV, we can actually work with any LLM out there for reasoning for complex tasks. And that's how the entire stack actually complements each other.

And then we can work across various use cases. perfect, Angur. I think that's a great perspective. So now I think shifting to the hidden enterprise constraints, Angur, I think which you touched upon when you spoke about data residency. So what do enterprises care about that builders often ignore? So I think builders are more focused about the use case, the complexities, the servers, the infrastructure, all those things.

But there are several other things, I think the security standpoint. So think about compliance, reliability at scale, finance cares about predictable cost when you think about enterprise, right? So how do you think these things are going to change or how do you think we can complement these things together?

Ankur Edkie (30:15.148)
Yeah, I mean, reliability is the name of the game, of course, in enterprises, even more so on the naturalness, right? Actually, if you go by use case by use case, there is this whole debate on how much does naturalness play a role? How much should a bot actually try to avoid the fact that it's not a human? How do customers behave when they figure out that it's not a human upfront versus not being told upfront? So all of that actually boils down to the topic of trust, right? Your bot has to build a trust with your

enterprises customer. So it comes with how you are responding. It comes with what the voice you're using, the expressiveness, whether you're listening to the customer. There's lots that comes along with the expectation of trust. It's a lot more than how people think about some of the systems in isolation. If you're a builder and you're picking your vendors and you're looking at a text system in isolation and say, wow, this sounds like the most natural voice that I've ever heard in my...

life and that's why I should go with something like that. That's not going to pan out. That's not enough. That's not nearly enough to build a system which has that. The right way to think about it is still like, if you do a hundred calls, you collect the feedback from the hundred customers that you've spoken to, whether the outcomes that the enterprise expected were achieved or not. That's the gold standard, right? You're not going to succeed by evaluating individual systems in isolation.

and assume they'll work together well enough for the entire system to work. So an end-to-end holistic sort of testing framework is important to succeed. And that's where I think most evaluators or builders try to build this also probably fall short. When they think about how enterprise would judge this system, first how you are actually evaluating this yourself, that there might be big difference.

Rishi Ahluwalia (32:01.576)
Absolutely. I think that's something that's not always on the first priority of builders, right? But with things for enterprises, it definitely changes. The other thing is still the cost illusion, Ankur, right? So if you talk about enterprises, they don't really have one-year contracts these days, right? They look for multi-year contracts. They try to stick with vendors who are doing that. Although that is also changing with the new technologies that are coming up, right?

So the next thing that is important to enterprises, Ankur is obviously the cost, but there's still an illusion about this, right? So, and there's obviously some misunderstandings as well. So let's talk about Murph TTS, for example, at one cent per minute list price. What are people underestimating?

Ankur Edkie (32:48.654)
time and in the past, if you look at the past two years as well, there's definitely a significant drop coming in as things progress, as models get more efficient, as GPUs get cheaper, as scale hits, as more competition builds, you're definitely going to have cost efficiencies coming into the system. So count for that. Even if you're doing a multi-year contract or if you're thinking about building the system today, which may not be as efficient as you want it to be.

If you want to be ahead of the game, you need to start investing today in voice systems. Yes, economies will come. They will flow on their own. I don't think it's the right time to worry too much about costs because it's a new system. There's always going to be economies of scale. If you have scale, of course, you already get the best pricing. And even if not, just the trend of industry, if you look at past two years, the GPU costs and everything is just going down very rapidly and you continue to. So they're not accounting for the future, I think.

Rishi Ahluwalia (33:46.598)
Absolutely. I think it's going to get more affordable and I think we're going to see more voice adoption in the use cases we never even imagined. So that takes to the next question. So we've been talking about quality, hyper realistic voices, the naturalness that comes in TTS, but the fascinating change is that do users really want hyper realistic voices? Do you think there are still use cases where simpler voices perform better?

Ankur Edkie (34:16.014)
Of course, there are lots of use cases where each of them have their own play. Hyper-realistic, in fact, is actually less of a use case in enterprise, in my opinion, least. Plays out great as an assistant in a B2C setup. It works. want people who are able to have an engaging conversation. You're not actually looking for any major goals. You're just looking to chat with your friend. And that's a different storyline, and you want that. And that has its own industry for it.

and perhaps a little bit in, in education where it's more engaging, want kids to still sort of engage and not get bored. And there's, are a few cases where hyperrealism helps. Of course that also is only useful if it gets it right. mean, hyperrealism should not mean that it's just a very variable prosody. Like where it's doing random stuff, it just feels it's not flat and non-flat is not hyperrealistic is what I'm trying to say. So, so hyperrealism has its own scales of what it means to be hyperrealistic.

But on the other end, where most of the industry sits and what enterprise are looking for and where even the regulations are going in terms of making an automated call, I think you're going to need to announce yourself immediately that you're actually a bot, right? It's not like you're going to be trying, you're not going to try to fake a human. And that's where the world is going to be. Having said that, people essentially then want, when it's already known, what people are looking for is actually, as I said before, trust. They want a voice.

which is making sense to them, like, which is actually being serious about their problems, right? You don't want a bank teller to sound excited about your bank balance, right? You want that person to be professional. So there is an expectation of being professional. And most of the times professional does mean that you're not being too hyper realistic. That doesn't mean that just from where the industry is, what people expect, given the fact that we're going to announce ourselves as not being humans, um,

huge market for not trying to be very, very casual about the speech. Realism still holds value even in a professional setup, but, but it's not, doesn't mean being casual.

Rishi Ahluwalia (36:19.456)
I think then it's more about personalization than customization. So people still want to relate to the person that they are kind of speaking to and they want them to obviously respond in a certain way, a problem that they're trying to resolve. yeah, think that's an, and that's an honest take.

Ankur Edkie (36:36.558)
You want, want apathy still, you still want relatability, you still want that people, get you. But yeah, not overly enthusiastic all the time.

Rishi Ahluwalia (36:44.152)
Yeah, absolutely. Great. So I think thank you for those insightful thoughts there, Ankur. Now talking about the future, which is obvious in one manner, but is also non-obvious in the other manner, is looking ahead. So will voice become invisible infrastructure going forward with the kind of acceleration that we are seeing right now?

Ankur Edkie (37:10.004)
I do think so it happens with every technology. mean, people don't say anymore that we are on cloud as a bitch as is obvious. You don't think about electricity, right? It's assumed that it'll work. I'll turn the switch on. It'll work. So that arc of technology is always there. It's not just even about voice. It's going to expand to voiceboards. It's going to expand to even LLMs. Right? You would assume that of course it's a voiceboard. I don't think that to be very unique whether voice itself will

get there even sooner than other things, probably yes, because of the maturity where the industry has reached. Maybe it will be, reach there faster, but yes, since this is how technology works, I guess.

Rishi Ahluwalia (37:48.642)
No, absolutely. think he, and I'm not trying to be too dramatic on the next question here, Ankur, but with everything that's going on with the acceleration in technology, with things changing in weeks rather than in months that we were used to before, right? So what's one shift coming that most people might not be ready for?

Ankur Edkie (38:11.406)
I think one of the things is people are probably able to assess the pace at which things will change. mean, people like us who are in the technology business, are working this system day to day know where the tech is today and where it's going perhaps, but the vast majority isn't. Like just for an example, I think we think about voice agents. If you think about where voice agents are being deployed, it's either like an outbound call, you're probably trying to do sales, or an inbound where you're to do customer support, right?

where I see this is going, you are going to actually have proactive outbound calls, right? Imagine it's an AI advisor who's actually making friendly calls to you just as a customer success representative as well, right? Just like people think about LLMs as being the next big wave of replacing a lot of jobs and taking up lot of simpler tasks up. It's somewhat similar with voice agents as well, where voice agents can do a lot more of client relationships.

going forward, right? You don't actually want need every conversation to be done with humans anymore, right? And then if you extrapolate it slightly further as a customer, how often do I need to still keep receiving calls from voice bots, right? I might want to have my own agent. So at some point I would have a personal agent who probably talks to my personal advisor agents, right? So agents are probably talking to agents at some point. And that's not as absurd as it feels to begin with.

If you extrapolate it further, perhaps they'll drop the voice layer. If bots are talking to bots, why do they need voice? So you could of course keep extrapolating it. How quickly that future comes in, we'll see. But I think one sided conversation being replaced mostly with bots across the board with any company you're interacting with. Seems like it's very near where you have your advisors calling you up. Let's say you want updates on your portfolio with a financial advisor. You get that at any point in time, right? So.

That future is very near and most people are really, I think, not thinking about those things. At least common man isn't.

Rishi Ahluwalia (40:14.816)
Absolutely. And someone like me who's following the stock market right now, when you were responding to that question, right? That's one thing that lit up my thoughts as well. Proactive, know what, give me a signal where obviously I can do wealth building, I can make money and all of those stuff. Maybe life experiences as well. So yeah, I think that's something that people would definitely look forward to. Interesting take there. Thank you.

Ankur Edkie (40:44.173)
Yeah.

Rishi Ahluwalia (40:44.646)
all right. So I think, Ankur, we're almost at the end of the podcast right now. To wrap up things, what's one hard lesson about the enterprise voice that every builder should know from day one, especially coming from you, right? Since the problem that you were obsessed with was pre-LLMs, right? So what did you think then and what do you think with the augmentation of LLM, which was going on in the GPU side of things?

Ankur Edkie (41:13.058)
I mean, there's this whole enterprise playbook that of course has almost transferred from SaaS world into AI world. How that's transitioned where what it takes to succeed with enterprises and how they operate and how they buy software. And some of that has carried forward, but a lot of that is not carried forward as well. Like things, the way enterprise make those choices and how they evaluate AI systems. It's just so fast paced and so fast moving that they are.

willing to make some of those commitments upfront the way they were with SaaS. With SaaS, know it works. When you do take a demo, you try it out for a day, you know that it'll solve your purpose. With AI, there are a bunch of risks that enterprises are faced with today. They firstly don't know how to evaluate the systems. They don't know if the technology is going to be upended within a quarter or two quarters. How do I make long-term bets? And enterprises have always been in the bucket of taking long-term bets. They don't want to be buying a monthly software. So that's a paradigm shift. And I think...

As builders, we need to be able to solve this for them. We need to showcase where the technology, we have to basically be speaking their language and bring them closer to the reality showcase how they should be thinking about the systems, right? And the base at which, and show them if things do change, how we are going to react as vendors for them, for the customers that we are being interested with. So I think we have to, as builders, observe that mind shift.

between SaaS selling and AI selling. Within voice specifically, I think it's all about experience, the overall end-to-end experience and customer outcomes. I think enterprises still think about it as one single holistic voice agent system than actually thinking about it as text-to-speech or speech-to-text or buying individual APIs. They are looking for outcomes. So outcome first pricing, outcome first conversations, outcome first product building ground up. Those are important things.

which was true for SAS as well, but more so for AI.

Rishi Ahluwalia (43:07.982)
I think great you mentioned about outcome based everything actually now because in the SaaS world it was mentioned, but not really completely adopted. But I think with AI, the bets are only on outcome rather than how a product performs, what is the capability. If it serves the outcome, it serves the purpose and everybody appreciates it. So great, Ankur, thank you so much for all your valuable thoughts and insights for today's.

edition of the Agora's Convoy AI World podcast. It was an amazing time. And I think, thank you so much again for your time and for all the insights.

Ankur Edkie (43:46.04)
Thanks for having me, Mr. G. It was fun.