Hermes Frangoudis (00:06)
Hey everyone, welcome to the Convo AI World Podcast where we interview the builders and founders pushing the voice AI space forward. Today we have a very special episode for you. We're focusing on speech to text and we're speaking to the researchers, developers and founders that are revolutionizing the way machines understand audio and translate that into text that is usable by large language models and AI systems.

Let's dive in. What is the first piece of it, right? Like the speech to text, what's happening there? What is ASR and like noise cancellation, voice activity, the VAD? Let's dive into a little bit of that. So the three main components of a voice conversational voice pipeline are the speech to text or sometimes called ASR. They're sort of interchangeable, but they mean slightly different things. And that's where the computer gets to understand what you're saying. And that gets turned into text.

And the second part is then being able to take that text and come back with an answer. We've all played with chat GPT online. You know what it's like. You send it some texts. comes back pretty quickly with a pretty decent answer. You can ask it to keep that limited to a certain number of words. And then those words can get turned back into voice again and spoken back to you. And that's what's known as a cascading pipeline. That's one sort of field of this. The other areas is what gets called a real time pipeline.

They're both actually real time. is that the voice to voice ⁓ pipeline? OpenAI coined the phrase real time, even though the cascading pipeline I just described is actually just as fast. So it's also real time, but real time as coined by OpenAI and also Google Gemini does the same, actually takes in voice directly into the LLM and sends voice back out again. The idea being that it's even faster.

But even though the speeds are similar, it does have other advantages in as much as the LLM can actually hear the emotion in your voice. It can comment on your pronunciation of certain words in the language tuition set up. And it's able to format the tone of the output of your voice to actually match the sentiment of what it's saying a little bit stronger. though it's a bit of a black box system, it has its advantages in taking the audio in first and dealing directly with it versus the cascading that's more...

In theory, but even with cascading, when the speech to text happens, you can put in, or some speech to text engines will put in metadata to describe how the person spoke it. Like, was it sad? Was it energetic? Those types of things. And also when you send the text from the LLM output to the text to speech engine, you can include markup, speech markup to tell it which bits should be pronounced in certain ways. And again, control the emotion in your voice.

That's comes out at the end. actually they're pretty similar, even though no one really knows exactly unless you work at open AI or Google Gemini. I think the implementation behind the scenes of their real time voice to voice models is kind of around a cascaded model anyway. Although interestingly, when you look at the text coming out of open AI along with the voice, they often mismatch. So I might hear, Ben, how are you? But in the text it's saying hi Dan, how are you? So they're not perfectly in sync. It's like they're using two.

different systems to create the effect of a pure multimodal LLM. And by multimodal, we mean capable of speaking text, voice and video all at the same time. the text is more of like an approximation as to what it said versus like a word for word. It speech to text on what goes in and out of the LLM separately to what the voice LLM is actually doing. Very interesting. And they're not perfectly in sync, which is a bit misleading at times.

At least with a cascading pipeline, you know that what you see in the text is a hundred percent going to match what was said and what it thinks you said and what it's trying to say back. Yeah, I think that transparency is pretty key though, right? Because the black box method has some of those limitations. You don't get the same access to that information for auditing purposes and set up in that sense. But you also have kind of like limited control, right? You have to use whatever

function tools that they have put in place there and that sort of thing versus cascading. With cascading, you can mix and match your providers. My hottest pipeline right now for the language learning use case is to use 11 labs for speech to text, even though they're known for text to speech, but their new speech to text is really good for multilingual. Then into GROK for your LLM, probably Llama 70 billion model.

which is super fast, you it's on specialist hardware, so it replies really quickly. And then into OpenAI's text to speech, which is able to mix languages and has really good emotion built in, different accents, dialects, it's really quite good at telling jokes, it can whisper, it can do this type of thing. That's a pretty interesting pipeline, right? Like completely from left field in terms of what each of those providers is really known for, right? You would think OpenAI would be maybe more the LLM of choice.

Exactly. So it's back to front, isn't it? Yeah, that's super cool though. And I'm going to have to try that out for my next project. So what motivated Deepgram to build a speech-to-text engine, kind of from the ground up, using that end-to-end deep learning?

Guest (05:21)
That's a

great question. takes us back to the early days of the company when it was founded and they chose to focus on that early on. That is not actually the first problem that they tackled. So the founders were originally dark matter physicists and they were doing a lot with audio. They were using audio signals and like shooting them into the earth and then measuring what came back and then trying to use machine learning to understand whether or not there was like dark matter present. so they had some very strong expertise that they were building.

but applying deep learning in their particular application. This is back in like 2015 timeframe. They were also doing weird stuff like hoarding themselves for weeks, like all of their audio, wherever they were, they attached mics to their clothing and they were recording themselves. And so they amassed this very large volume of audio from recording their everyday lives. And they were like looking at their machine learning stuff that they were doing, looking at this thing, this large corpus that they would never be able to listen to.

to like find the interesting tidbits. And they decided to try and tackle the problem of like using the machine, machine learning to search the audio just as a side project. And that was how deep grab started. They went and founded a company. They built a deep search algorithm in doing so. They actually indexed all of YouTube at the time. And you could like find random audio clips in like a YouTube scale corpus. They demoed on stage at GTC with Jensen very early days, but they realized at that time that search was not.

was not a hot thing and there wasn't like a big market for it. So speech recognition at that time was an emerging green field and there were very few players and all the models were terrible. And so they had this strong conviction from the beginning that if you combined a system that's learned end to end, a single network and lots and lots of high quality data that you could build a model that could transcribe sort of like potentially any human in any situation. It was like just one of the early convictions that they had. They went about

like building an early prototype site, 2016 timeframe.

Hermes Frangoudis (07:17)
super interesting. Gotta love that gap in the market, right? Like everything else is terrible and you're like, actually, I think if we do this, it could solve this solution, right?

Guest (07:26)
Yep. They also kind of got lucky in that there were some early AI adopters from call centers, basically like AI platforms that had many call centers as customers. There's like huge volumes of call center audio that they wanted to transcribe and do analytics on. And that particular domain like is narrow enough that the very early deep learning models that we had actually worked. And you could train on these narrow domains and produce models that were like 80, 90 % accurate if you just specialize the data.

And that was like some of early magic of deep ground models that they worked for particular applications where there was an interest and a lot of data and people who were like willing to try to use the models. So that was one of the reasons that we, ⁓ that we have built for scale early on too.

Hermes Frangoudis (08:10)
super interesting. It was kind of like the approach kind of led them down the path, right? Like, hey, we're here, we have this stuff, we can apply it here. And now, because call centers, like you said, also have that very rigid, structured type data, they could apply it there and kind of scale. That's super cool. What are the most common misconceptions people might have about speech recognition? I would say the biggest

Guest (08:35)
This

one is that it is a solved problem. think even Jensen has said that in like his most recent GTC keynote, he said speech recognition is a solved problem. And I think that it is, it is definitely not a solved problem. It's only solved in, it's solved in some very narrow situations. So like it works well in situations where we have a ton of data. So there's particular use cases like call center audio, for example, in English, we have a lot of data.

We can train models at scale and this paradigm of like, you know, large expressive deep learning model train on lots of data. works well. And then we've, we've over time collected enough data across a lot of niche domains or what would have been considered more niche domains a few years ago in English. And so the models have gotten really strong in English, but I think that like in non-English languages, the models are in general still pretty terrible across the board. And it's just because of the lack of data. And then I would say.

Hermes Frangoudis (09:25)
push right.

Guest (09:33)
beyond just like having data to cover the very broad range of speakers and acoustic conditions that you're trying to model. The other big challenge that is not solved is being able to recognize rare and localized words. Like words that are specific, say to a particular customer or a particular person, like how their name is spelled, for example. Words will continue to be a challenge moving forward. You like getting the words right. It's like one of the core challenges that's sort of underappreciated, I would say, in Spatial Text.

Hermes Frangoudis (09:59)
feel like it leaves a lot of room to grow, right? Like there's still very early days on this sort of thing. That challenge is kind of what you guys have solved on the English side of it. And I'm sure you're going after international and like colloquial, like localized ⁓ dialects and stuff like that. terms of how you take the approach for speech to text, really how does Deepgram like...

balance this real-time latency with the accuracy of the transcription.

Guest (10:33)
That's a great question. there's two major use cases for speech to text. And in your question, you're honing in on real time. So where you want the model to be actually transcribing the words as soon as possible after they're uttered by the person. And that is just a very hard challenge. The other major use case is batch or asynchronous transcription. And that's where you have your 10,000 calls for the day.

that were recorded at your call center. And you want to be able to transcribe all of them in like 30 minutes or an hour or more, like as quickly as possible so that you can get the insights about what happened today. And so I would say like, you know, deep RAM for both of those kinds of problems, we have always worked in a constrained design space where we are trying to achieve the maximum accuracy subject to like, it must be fast and scalable. It's like very concrete.

performance or engineering requirements. And when you operate in this constrained design space, there are many approaches that do not work well, that otherwise look great, like in papers, know, models that don't scale. Also, very, very large models become impractical if you're trying to hit latency. And so it's pretty simple. Small models, constrained design space, and then actually like...

imposing those requirements from the beginning so that you've designed with with scale and speed in mind.

Hermes Frangoudis (12:02)
So really having that kind of from, from the start, not taking things that will, yeah, they work in batch because batch doesn't matter. ⁓

Guest (12:11)
I would say, so I'll say like one more, go like one level deeper there with double click. The data actually impacts being able to solve that problem. So the more high quality data that you have and the more localized it is for whatever particular domain you're trying to model, the smaller you can make the model for a given level of accuracy. And so there's like ⁓ a joint dependence there. If you want to think about it, the better the data you have, the more efficient model you can use.

Hermes Frangoudis (12:15)
if you will.

Guest (12:40)
And so we can leverage our data advantages that we have. So we have collected and labeled like a lot of high quality data to actually make the model smaller.

Hermes Frangoudis (12:48)
And, and, and like you said, the better the quality of the data, the smaller the model, the more accurate it can be. So can you walk us through a little bit of that model training process, like the data sources, pre-processing, that sort of like architecture?

Guest (13:02)
Yeah. I'll start with like, like a big picture. How do you train like a state of the art speech recognition model today? I think like the simplest way to do it is that it's a two stage process. And this is typical across many deep learning models. You have like a pre-training stage and a post-training stage. So there's parallels to this in LLM training, where the LLMs are produced. Kind of the same thing in speech recognition. So in the first stage, you're trying to train with a very large scale of data.

as much data as you can get your hands on, as many voices as you can get, as many acoustic conditions as you can get, and then as many examples of words being spoken. So this is something that is maybe like underappreciated that you want the, we want the model to have a very broad exposure to speakers and audio conditions. But when you do that, you also increase the frequency of particular words relative to others.

Hermes Frangoudis (13:33)
covering.

Guest (13:59)
So there's this interesting like scale effect that when you scale up the data, like the frequency of stop words, like the most frequent, like ands and does, it explodes. And then you have an emergence of a very long tail of, of rare words that appear. So basically the best you can do is just keep scaling the data as much as possible in the first stage. And you train a model on that. So in that case, you're training primarily on, on data that's crawled from the web. And then you're filtering that data to the best of your ability.

to isolate audio that has human transcripts where the human transcripts are good, basically. So that's the name of the game for the first stage. And you get a model that is like pretty good, I would say. And that is how, for example, like Whisper was produced. Whisper is like the first stage of a production grade speech to text training. In the second stage, you specialize the model in in post-training and you train it on a much smaller, more narrowly distributed corpus.

that covers just the domains that you, that you largely that you care about. Yeah. You focus the training. And in that case, the data is like very high quality. must be like a basically gold, gold ground truth labeled by humans that are following a very prescriptive style guide so that the labels are sort of consistent. ⁓ in the first, in the first stage, you have labels that have been generated by like millions of different humans with no consistency in style. And so then you have to like unlearn that.

in the second stage and the model's output becomes...

Hermes Frangoudis (15:30)
everything I taught you.

Guest (15:33)
Yeah. Yeah. And so that that's, that's basically how it goes. The magic really happens in the second stage, although the first stage is also important. So one thing we see one thing I'll say, it's like something we've observed that we haven't like, published out say is that as you scale the, the corpus in the first stage, there will be a set of words that you've seen, let's say like 10,000 times or a hundred thousand times. And those words.

like you sort of saturate the model's ability to predict those words in that if you were to show the model more examples of those words, wouldn't help. It wouldn't help. And then you have this long tail of words that you've seen less than that threshold, let's say 10,000 to 100,000 times. And word error rate depends just directly on how many times you have seen those words. Yep.

Hermes Frangoudis (16:26)
It's pretty crazy. So whisper essentially the thing that everyone claims to kind of be building off of is just like part one of the puzzle. So if you're not applying part two, you're not going to get the right. Performance that you should be getting.

Guest (16:41)
Yep, that's right. And you get all kinds of what we call model pathologies that result from part one. The model will insert words that aren't there. The model will omit words that are actually present in the audio, what we call shyness.

Hermes Frangoudis (16:54)
I've seen this and I've had people tell me, no, that doesn't happen. It never happens. it's like, totally happens. And you get like one thing in the audio and then one thing in the transcript and you're like, they don't match.

Guest (17:00)
What totally happens.

Yeah. And it's like, customers react to those two failure modes differently. When the model's inserting words that are not there, it can be very creative. And so you sort of never know what it's going to so that's failure modes, you know, depending on what the model's using to say. But then the silence is just like universally despised by all customers. The model should produce something when words are happening. And that's a big one. Yeah.

Hermes Frangoudis (17:18)
Boy.

super

interesting to hear. I feel like sometimes missing a word is also a little bit more forgiving because you're like, all right, I just kind of missed the word. didn't like add to, like it's a lot more jarring when there's words added to it.

Guest (17:45)
Yeah, that is more jarring, especially if you're seeing it in real time. It is kind of interesting that like real time speech recognition is just more challenging than batch because you're operating on like small chunks of audio at a time in principle when you really shrink down the, ⁓ the buffer of audio that you're sending. And so that's just harder. have less context. You don't get to see what is coming in the future in batch. could see the whole file at once, right?

Hermes Frangoudis (18:10)
You're constantly like building that, like as it goes you're like building on the previous buffer that you got.

Guest (18:16)
Yeah. And there's a tremendous range of different things you can do there from a modeling perspective, but largely speaking, you're like maintaining some state about what you've seen so far and updating it as new audio comes in. And then the model may be doing something like deciding whether or not it's going to emit a prediction at this frame or not. And so real time opens up all kinds of interesting mechanics, but model pathologies are also more prevalent as well. then also like

real time is the setting where people are watching the transcription live and they see them. Right. So yeah, for all these reasons, real time is like way harder.

Hermes Frangoudis (18:52)
You don't have tell us. We're well aware. And it's not just like talking to an LLM, talking to a person and real time voice and video streaming. That's like another one of those things where someone's talking, you expect that happen. How do you manage that state in real time for the VAD to be so precise in that? Yeah. VAD essentially acts as the traffic controller here.

by detecting both the start of sentence SOS and end of sentence EOS. As I mentioned before, it is what triggers the entire chain in the real time. So for example, once VAD pickups the end of sentence, like a 200 milliseconds silence, well, this time we can adjust based on our needs and it signals the SDT system. So the speech to text system to finalize the transcription and this transcription

will then be sent to the LLM and the LLM outputs then triggers TTS to synthesize a response. So this is why low VAD latency is very important in the conversational AI application. Yeah, it's like that first, like you said it best. I've never heard it said that way, but I really like how you put it. It's the traffic controller. It's like, let's go, let's go. You're coming in. You're coming in. We're done. We're done. That's it. Time to process, right?

End of sentence and end of speech. Let's get it moving. But another critical point of that traffic controller is also to be like, oh wait, stop this whole process. You know, we have, we have an interruption. So VAD has to like, you know, the start of sentence, right? Yeah. The start of the sentence controls the, um, or affects the interaction latency, I think. And the EOS.

affects the whole end to end or response latency. So it's very important. Yeah. It's such a critical piece. then when you see 10 being run, right. And this VAD, is it more edge or full cloud? Like how important is it where you run it and given like the dependent on the use case? The VAD model is essentially, it's basically

Very small. is very small. Only several hundred kilobytes. Typically it can be run on H devices to ensure low latency detection of voice activity. ⁓ And users hate delays when interacting with agents. Running VAD on the H just cuts down the lag because it is right there on the device. Detecting speech starts and stops instantly, immediately.

The another thing is about the bandwidth and the cost savings. Yeah. If the VD is on the edge, it only sends audio frames that actually have speech to the STT, speech to test or at the ASR system, whether that's on the edge tool or in the cloud. And it can help you or help the users to save the bandwidth costs, the STT costs.

and keep things efficient. Yeah, that's huge. The STT cost is something you don't want to like mess around with, right? Because you don't want to just send the noise or send the non-speech segments to the STT systems because it is not meaningful. Exactly. Yeah. Nothing would come out of it. You're just burning tokens. You mentioned ⁓ in the relatability, the ability to

pronounce certain words and I noticed like different voice models will pronounce or mispronounce the same word. Like even though it would be a word that I would think they should all pronounce the same, it's funny to watch some stutter on it or completely mispronounce it versus others that hit it on.

Guest (22:56)
I

think like, you know, if you throw like a list of extremely challenging brand names to any Texas speech model, you're going to find that some of them pronounce them right and some of them pronounce them wrong and that set is different for each model. At the same time, like we're often selling to teams of enterprise developers. We sell to enterprise teams, teams of developers are building really like high volume calling applications for.

In many cases, like Fortune 100 businesses. And you can imagine, right? It's one thing to mispronounce a word. It's another thing to not have predictability over whether a word is going to be pronounced correctly or not. And so like a lot of people build models. I would describe them as like general purpose models, speech models where like they're really good at reading things out. But what you lose from having this, this like less rich, what you lose from having less richly annotated data on

the phonetic level, right? Like I'm talking like having a phonetic transcription system for the data. And that's what you train the text speech model on, not just how the words are spelled, right? How you and I would spell them. Then what you lose is predictability and control. And what I mean by that, right? Is like, if you send hog and DOS to our API, maybe I'm not sure, maybe we pronounced it correctly, incorrectly today.

But like we want to build models where we can tell you A, that we are less confident that we're going to pronounce it correctly. And B, if we're not pronouncing correctly, you can fix it immediately without us having to retrain the model.

Hermes Frangoudis (24:19)
So being able to pass that like the pronunciation annotation kind of with it, if it's not, if it doesn't feel confident enough.

Guest (24:27)
And by the way, like, yes, exactly that. And at the same time, like this is a feature that's existed in Texas speech models since the very beginning of Texas speech models. And at the same time, no one else has built like workflows around telling you, right? Like here are the words we're not pronouncing correctly today. Otherwise you just have to like call yourself and guess. And by the way, this wasn't a problem like pre-LM because if you're talking about this IVR maze, right? Like everything's basically like pre-generated. You can run QA and once you run QA, that's it. You're like, everything sounds good. If we're not pronouncing.

hogging us correctly in the phone tree. You pass the international phonetic alphabet. You use the text speech model to create audio. Done. But like now in the era of LLMS, you don't even know what the voice agent is saying before it says it.

Hermes Frangoudis (25:09)
It's completely unpredictable. So the ability to pre-QA it has gone out the door, right? So it's more about how do you pronounce it predictably? How do you ensure that there is confidence or no confidence?

Guest (25:22)
And really it's like, the rhyme thesis is like, if you don't have a path to a hundred percent accuracy, then enterprise won't adopt. I'm not saying you have to have a hundred percent accuracy because we're never going to have a hundred percent accuracy, but if you don't show teams and developers who are building voice agents a path to a hundred percent accuracy, they can't build a product really. Yeah. They can't build. Right. And so like, if you're, you know, Providence medical and you're building a voice agent for doing like a genetic counseling screening.

You have no idea what the patient on the other end of the call is going to say, right? Maybe they have family history of cystic fibrosis. You never thought about cystic fibrosis before in your life. And then the voice agent mispronounces cystic fibrosis and you're like, this is the opposite of an empathetic clinical experience. Do you know what I mean? It goes down the drain.

Hermes Frangoudis (26:09)
Yeah, it goes down the drain real quick,

⁓ That's super interesting. So in terms of these nuances in the linguistics, like accents, intonations, the emotional variance, like all of this, is that captured like in the model differently than maybe like a regular TTS, ASR type system, or do you kind of follow standard methods? just more about the richness of the data and how good you are with

Guest (26:34)
It's not rocket science. I would describe it as like not right. You know what I mean? Like it's not rocket science. Like garbage in, garbage out. No, no, no, no, but it's, it's worth restating. Like what makes us different is not necessarily that we're like rocket scientists and we have linguistic expertise and like we value that linguistic expertise. And so like, as a result, we foreground it and are the data that we collect and annotate, but

Hermes Frangoudis (26:41)
No, it's common sense. hear.

Guest (27:00)
At the end of the day, it's just like, that's what we're focused on. And if you're building a general purpose voice model, you don't need, I mean, you might not care. mean, that's okay. And that's like, what I think is most exciting too about modeling, right? It's like, okay, say Sora 2, right? Like did they build Sora 2 with the intention that like, if you Hermes are adding your likeness of Sora 2, that it's going to like pronounce your last name correctly? No, I don't think so.

Hermes Frangoudis (27:24)
and there's so many cases.

Guest (27:27)
And so like, but I don't think that's what's important about Sora 2. Sad to say, like I would be great if it could pronounce your last name correctly.

Hermes Frangoudis (27:35)
I'm happy that it my first name, but you know.

Guest (27:37)
Anyway,

you get what I'm saying, right? That's what's exciting. There's different models for different use cases.

Hermes Frangoudis (27:42)
And I've definitely seen it, like, ⁓ depending on what kind of agent you're trying to build or what kind of task you're trying to accomplish, the voice that you're using, the model that you're using all kind of plays into that pipeline. Right. So it's like selecting the right tool for the job.

Guest (27:57)
And like customer experience, and I would say like we're training voice models for customer experience is really like, and again, this is not a term I knew three years ago, customer experience. But when you, look up the definition, it says something like it's the sum total of impressions that a customer consumer has with your brand. Right. And so like different tools for different jobs.

Hermes Frangoudis (28:15)
Yeah. And you want them to have a good experience no matter what kind of job they're trying to do. And it all comes down to meeting the customer, meeting their expectations, which is really cool to hear. So what are some of the trade-offs maybe between like more expressive, more expressive tones or speech patterns versus like recognizable ability when you're designing some of this stuff? Is it purely just like, Hey, we're going to have these different flavors and we're just going to annotate.

Guest (28:20)
Exactly.

Exactly.

Hermes Frangoudis (28:44)
give them different annotations of data, like the richness and it all kind of comes out the way it does, or is there something you kind of do in that sauce?

Guest (28:51)
There's still lot of missing pieces, I would say, in like having voice models that are both highly expressive and also controllable. Like there's still such a trade-off there today. And it doesn't have to be that way. I would just say like, again, people are focusing on different things at different points in history. And we're at a moment right now where people have sacrificed controllability for expressibility or for expressiveness. And again, it's not going to remain that way for very long. Like rhyme will train.

You know, the most expressive speech synthesis models today are trained on top of large language models that were, that have only seen text really. It's strange, but true. Like you can take a large language model that saw 25 trillion tokens of text, just text, right? And then you start showing it text and audio and it learns text and audio. It's crazy, but true.

Hermes Frangoudis (29:40)
so becomes kind of multimodal.

Guest (29:42)
It is multi yes, exactly. Which by the way, like no one would have ever described a text to speech model before as multimodal. Like you predict audio given text, but in that way, like text speeches, like the OG multimodal it's never not been multimodal what I'm saying, but like to your point, yes, like it definitionally in the way that people talk about large language models, it becomes multimodal. Yes. Yes. Okay. The benefit of that is like, it's not 25 trillion tokens of language, right? So it basically understands something about language.

Hermes Frangoudis (30:10)
it's able to create those little nuanced patterns in the written language, which can then somehow help reinforce it in the audio.

Guest (30:19)
Crazy but true. And at the same time, those 25 trillion tokens of text data are not phonemic phonetic representations, right? They're just not all when I would spell words. Yeah. And so like, there's still work that needs to be done to like post-train the text-based LLM still on text or phonemic text. Yeah. And so what you're seeing now is like, and no one like essentially this is like, this is frontier research is like how to take this LLM backbone and then

Post-train it to learn phonetic representations in concert with these orthographic, the written forms of text language, and then post-train it again to become multimodal. And so that's really the cutting edge is like, and that's how you're getting like a high level of nuance and richness in the spoken language while also getting the controllability that comes with the phonetic representation. So that's the future.

Hermes Frangoudis (31:08)
That's amazing. And that's like the bleeding edge right now. Like that's the frontier that is really pushing the space forward.

Guest (31:14)
Yeah, it's like Google and Ryan basically that are doing that.

Hermes Frangoudis (31:17)
What is Rime approach modeling for like underrepresented maybe like dialogues or speech patterns, like things that are not that common, but they exist enough that when you're an enterprise, like you need to account for these edge cases or these things. ⁓

Guest (31:32)
You

would be shocked at like how, like we describe it as a long tail, right? But you would be shocked at like how tail-like it is such that like you have, right? Like one of the most prominent, uh, like agent builders for customer experience in 2025 saying they can't get high quality, uh, Castilian Spanish voices, like Spain Spanish voices. Like literally almost impossible.

Like, yeah, Latin American Spanish there, right? But like, and why is that? And they have no idea why. Like they tried this model, they tried that model, they're like, they put it in front of like, you know, large enterprise in Spain. they're like, people are like, this sounds like someone from Columbia. You know I mean? And so I, like, it's such a moving target too, cause like languages change all the time and like the colloquial variety of a language changes all the time. And as these AI agents become more capable, our expectation that they remain colloquial.

right? And like sounds like people talk today will be ever increasing. And so you have to collect that data. I don't know. And you have to like train models that are purpose built for fidelity to how people expect that people talk. And at the same time, not only, mean, there, by the way, there exists like, has still in Spanish models from four or five years ago, people just don't like them anymore. Right? Like taste is ever changing. And with that said, like there is essentially no high quality Hindi text speech model.

in existence today. There is essentially no high quality, like Arabic text speech model in production. Why is there no high quality Arabic text speech model that exists? Because people built Arabic text speech models for what's called modern standard Arabic, which is a form of Arabic that no one speaks natively. It's like what every audio book would be narrated in. It would be like, as if we were reading Shakespearean English all the time. And that was the only thing that existed in data sets of English. And then we're talking right now and we're obviously not talking like Shakespeare.

I mean,

Hermes Frangoudis (33:28)
And even talking about like traditional English, we're talking like American dialect of English,

Guest (33:33)
Right.

No. Yeah. So like the first order problem is just like having, ⁓ you know, Saudi Arabic, like generic Saudi Arabic. And then moving down, Cause like we've proven that like, if you talk to a fast casual restaurant voice agent in Atlanta and you hear an African-American English voice that you're likely to have completing that order increases. The relatability. Exactly.

Hermes Frangoudis (33:40)
and then moving down into the.

just because of the relayability.

Guest (33:59)
And so like, is there anything like that happening in Saudi Arabia? No, not at all. Because they don't even have Saudi Arabic models to be in.

Hermes Frangoudis (34:06)
wonder if they're doing it in Greece and Cyprus, where my family's from. Those are very different dialects of Greek that it's funny you see videos on Instagram how it's pronounced in one versus the other and it's just like, man, you don't even realize.

Guest (34:20)
I know. Yeah. No, people don't realize that like someone in Saudi Arabia couldn't even understand someone in Morocco. It's like Spanish and Italian.

Hermes Frangoudis (34:24)
Yeah, because it's just completely different dialects.

I also want to hear about what you're building with voice and video, because I know you're building some really cool stuff on bringing this all together. You're using Agora, but can you maybe walk us through some of the awesome pieces of software and systems you're putting together? Yeah. So I like Agora for its ability, its enterprise grade ability to stream billions of minutes. All right. So it's a enterprise grade infrastructure that has been tried and tested.

Right. So, and then when it's related to the conversational AI, there are like many tools out there. Some of the notable ones are related. 11 Labs is a pioneer and OpenAI, obviously you have it and Microsoft Azure, voice libraries are there. And there are many other conversational AI platforms that are coming up, which is like super exciting in this space. However, what I like about Agora is it's a tried and tested.

enterprise grid framework on top of it. And it is being built in a way that we can integrate with the various systems. mean, if you don't mind, I can quickly share something. Yeah. It's built like Lego blocks. Gives you all the control while taking away all the headaches, right? Yes. Do you see my screen? Okay. All right. So what you're seeing here is like traditional Agora framework, right? So you have the Agora platform.

have a SIP WebRTC conversations and it has ability to integrate with enterprise systems in CRM systems, call records, like everything. And it can stream, it could do all kinds of like really cool stuff related to the text, voice and video streaming, right? On top of it, the conversational AI is being built and this conversational AI is more like you said, it's a Lego block which allows to integrate like with any other system.

So for example, this is like a stack that I'm just showing. So for example, at Colaberry, we have the AI orchestration platform that has a way you can fine tune LLMs, can fine tune text to speech, translation services, MCP connectors, and ⁓ all kinds of orchestrations and everything. So these architectures could be used to enable basically Convo AI. So let me just show you this particular flow diagram a little bit.

So here, as you can see, so this is an app where you have a real-time audio and real-time audio streaming can happen. And here in the backend, you can integrate with agentic flows here. So for example, here you can integrate with 11 labs, text to speech, right? You can integrate with open AI whisper modules, or you can integrate with Microsoft modules, right? And in the backend for an enterprise, which is like, you have like a lot of custom intelligence inside the company, right? So.

For that, may need like a large LLM models and you have custom databases. can build like Rack, Rack tools and everything, right? So the Agora platform by being the front end and having the ability to plug in any of these backend systems allows to build an enterprise grade text to speech, speech to text, conversational AI platform. And also, so any questions here? In here, assume I'll make it a little bit.

the person calls in. So this isn't like your traditional app. This is a dial-in number. So they're interacting with a traditional interface in terms of like the mode of connection, right? Yeah. So the beauty of Agora platform is it can work with both phone as well as well, we OIP, right? A regular telephone and we OIP. So you have a sweep switch, right? So with the switch switch, can basically stream.

the voice across the regular telephone line. But if like in this particular one, you could just have like a client, Android client or a pipe client. So it allows like both of those, right? So, and this whole Agoda infrastructure is already there and it works at the enterprise grade. So all we are doing is making it AI enabled because previously these are not necessarily smart conversations. with ⁓ integrating with AI platforms,

They become smarter conversations. interesting. You're using a core is kind of like the underlying level of the streaming orchestration and then tuning it into your pieces that you've been building over the years that your enterprises customers have been building over these years. The knowledge database, the knowledge base, their MCP servers, all these things that kind of can exist without voice, but now being able to bring it to this new interface sort of. That's correct. So.

All these interfaces already exist in the internet 1.0 and 2.0 world. Like now, how do you upgrade to the AI world where it's smarter? Right? So that's where the entrepreneurs may already be using Agoda platform for their call centers and all kinds of like streaming needs that they have, right? And the calls may be already getting recorded, et cetera. Now we can ⁓ integrate them with the AI infrastructure and make the whole

business smarter, and maybe they can even innovate like new products and services on top of it. So for example, they could have, ⁓ they could have automated, I don't know, like in a, if you take a healthcare industry, forming appointments is a big thing. Following up with the people is a big thing, like prescription medications. All these are like, usually instead of like humans following up or even robotic calls being made, like these can be like a lot more humanistic empathetic calls because

The AI interacts, it's just not delivering information. Yeah, it has the ability to interact so you can upgrade and make your, make the processes like lock more. You're getting away from that traditional phone tree. Hi, we're calling to remind you about your insert, whatever press one. Right. So this is like having a natural conversation, which for most people is, all they really want to be able to do when they call in. Like, I don't think they really care that it's a person or not. It's I don't want to have to like.

change the way I interact and speak to accommodate their phone tree or their technology that just makes my life more difficult, right? So this is an ability to really bring that human interaction back into things. Voice turns out like those different modalities that I spoke about content, acoustics and so on. It's extremely powerful. I mean, I knew this from my research, but

I don't think we ever anticipated just how powerful it could become with the power of AI and the right data set and so on. So voice is super powerful, so very strong. Like you can see multiple facets of somebody's state of body, state of mind. But the other big asset of voice is it is everywhere. It is ubiquitous. Everybody in the world, or rather, not everybody, but the majority of people in the world will use voice to communicate. So.

is very natural, it's very comfortable. It's an obvious way of actually communicating. And that also means it's very easy to access. People naturally speak. They don't naturally necessarily turn a video on or anything but talking on the phone, et cetera. People are very comfortable with it. And it's not as privacy invading as a video. So it's that combination of ubiquity, ease of access, ease of collection, the strength.

within the signal itself that meant we moved from being voice video behaviour to just being voice. The one problem we have had with voice and everybody has it in voice AI is it is so strong voice. It carries so much signal, not just about health, but about everything that or about everything around who you are as a human. So your age, your birth sex, your accent, your background, ethnicity, how many languages did you speak until the age of nine that affect how you sound now?

where you live. So that is a very strong signal. And you need to differentiate that from what is the actual signal, say for depression, fatigue, stress, diabetes, differentiate out and isolate the signature that is a condition. And that's kind of what we specialize in doing. interesting. And it comes multilingual? Yes. So because we started multimodal and we have so many ways of covering the gaps, our models do work across different languages.

We typically need to fine tune potentially a little bit for a new language, but we've done it now across Greek, English, Spanish, Italian, Brazilian Portuguese, Portuguese, Indonesian, and Japanese as well. So it's expanding now to all different languages. those are very, like when you think about the style of language of each of those, very diverse. are.

From intonation to tone to just style of speaking, right? Like some people speak really fast, just naturally. Yeah. Slow, loud. Some people just always sound angry when that's just like culturally how everyone sounds. My co-founder Stefano always sounds angry. He's Italian. What problem do you think was like, it's a problem that really stuck out to you in terms of speech recognition or translation that you didn't feel was really being solved at the time?

what gave Soniax like that early opportunity to come together. Yeah. mean, the key thing about speech-to-text, speech recognition is accuracy. And it sounds simple, but it's actually quite complicated. And when you say accuracy, don't mean accuracy just for English. Okay. And just for like clean speech environments, have really no noise and super high quality audio. When I say accuracy, mean just like speech AI that's built for real world. Okay.

That it's not just meant for English, but it's meant for all languages, or at least basically 8 billion people around the world. So which means that you need to address about 60 languages and you have to achieve extremely high accuracy for 60 languages. And the problem that comes with achieving high accuracy is like, how do you basically recognize a minority language with very high accuracy as much as a lot more popular languages while having tons of less labels?

or almost no label data. And what really struck out to me like at Facebook, because we are so early on in the speech recognition, we build speech recognition there for like, you know, 10 systems, 20 systems, different languages, is the amount of, you know, basically human label transcription data that you need to train a system. And back then it was clear that, you know, the more data you have, and this was year 2016, 2017, the more data you have, the better the speech AI system

And it was clear if you had, for example, $15,000 for English, English was so much better than if you had just like $500 for another language. So the key question that we stumbled and I stumbled is how can you level this, equalize across all languages? to give not just English, but basically 60 other languages equal opportunity to use voice, speech in their applications and still work really, really high grade.

So the key question is multilinguality, accuracy that comes with multilinguality. How to achieve this problem. And so it is clear to me that this human labeling of what's been happening and still is happening is not really a way to get to like this speaker, native speaker accuracy, okay, which is required. So if you want to have a voice application, voice-driven application that works naturally, you have to understand almost everywhere. So you can't like have a word error rate of, I don't know, 10, 20%.

And every fifth word or every tenth word is misunderstood. So ⁓ you wouldn't really use such ⁓ an application. So basically, at the very start, when you start a Sonic, the key question is how do we use unsupervised learning or so-called self-supervised learning now? To use insane amount of internet and other available data to pre-train models in a way that it can understand a lot of Danish, Finnish, Arabic languages and be able to recognize them very accurately.

despite the fact that there's very little human label made up for it. And we really pushed hard on this from the very beginning. This is what gave us an edge. And that's why today we have like, you know, really speaker native accuracy for 60 languages. You can go to Japan, you know, can speak in Japan, in Korean, Taiwanese, English, obviously, Spanish there, Portuguese, German, Italian. I mean, really like...

Most languages around the world with some significant population are covered with our ⁓ model and that recognizes them. Super interesting. And you touched upon there something that I want to dig into a little bit. You mentioned unsupervised learning and label data. Like one of the things that's been a constant theme, I would say across all these podcasts is when you talk to leaders in the space, everything comes down to the data, the labeling, how it's.

connected. at a high level, like how does Sony X approach this into like a single speech model across so many languages?

No secret sauce. I'm just saying like, like, like at a high level, you mentioned like you put together unsupervised learning systems. found ways around this, right? Yeah. ⁓ so we, we, we call it AI data factory internally at Facebook. So we have a, you know, we are basically utilizing, you know, the most advanced of AI models. Okay. From speech to LLMs.

language identifications to basically gather and create with AI, large training data sets, high quality training data sets on which we can then train our system. So this is, think of it like this, that what's happening with supervised fine tuning has been happening with LLMs or reinforcement learning. Now we've been doing that for three, four years now. Okay. So, we started basically with that. So, and we really pushed this process.

to very, very large scale. So we're operating on petabytes, many petabytes of audio data, many tens of millions of hours. So in the case, how do you use the AI, the state of the art AI, and in various ways to create the select great data to train a better AI. So this is what, where is the magic circle, I would say. ⁓ And how do you do this? Because speech training is much different from text training. Okay. So let's maybe dive here just for a second.

So when you train a text LLM, right? What is the input text LLM? It's text. And what is the output? It's also text. And typically, at least in the pre-training stage, these two texts are actually the same. Input text and output text are the same. They're just shifted by a token. Okay. So the model really predicts based on N tokens, it predicts the next token. So the pre-training and the collecting data for pre-training is...

In so many ways, much, much, much simpler because you don't need to link two modalities. It's actually the same. You just shift it a little bit to the right for one token or one word, whatever you want to call it. But when you go to speech and words, you have two modalities. You have audio signal as input, okay, raw audio signal. And what is the problem with speech to text? Well, you you have to actually predict text tokens, words, text. You don't predict speech. You don't predict the same modality. It's a different modality.

And if you go online, you will not find some beautifully created datasets for, say, let's pick Finnish, where you have tons of real-world Finnish speech and map to text. You will not have that. The key is like, how do you actually create such datasets where you can map, you know, ⁓ basically any language to text and then successfully train it. So when we think about

the bigger problems and where this space is going. How do you see speech recognition really evolving over the next year or two? This speech to speech experience, so speech in, speech out, STT, TTS, and some brain in the middle. This is really going to, I think, get to a point that's going to be very reliable, robust, and you will be able to converse with an AI, like with a human. I think this will happen, you know, like in, let's say in one year time.

So how much of the task it will be able to do, that's a separate thing, but at least you will be able to have a very, very human-like interaction, communication for some, you know, some period of time, let's say like, I don't know, five, 10, 15, 20 minutes or so, one session, not like, it will not spend like days or weeks or I think there's more to do there. But this will be the key thing. mean, just make any conversing with an AI, just like it would be. Really blurring that line between.

where it is and where it is now. Yeah. No, that makes sense. Earlier you mentioned unsupervised learning. And so do you see that really is like continuing to push the envelope forward along with better data and producing better data in that sense? Yeah. mean, like, look, just like we see with text LLM, like, you know, reinforcement learning, it's kind of taking a life on its own in terms of how you do training and improve the models.

The same kind of feedback AI data factory process happens with this multimodal systems where you have speech and text. yeah, mean, this is going to drive the reliability, robustness, accuracy, further down a little bit every time. ⁓ So this will be the key. The key I get.

the key for better systems. People often assume that speech recognition and speech is kind of like this solved problem. What do you have to say to those people? I reckon I've been told speech has been solved pretty much every year since about 2017, 2018. And don't get me wrong, it is sold for more and more use cases. But what is the long tail then becomes sort of the high volume. People's expectations grow, stakes as they sort of move into sort of higher risk and higher stakes industries.

become more important. So there is still a long way to go on it. Sure. ⁓ If I want to have a sort of a simple, I mean, there's conversation, right? This conversation is pretty well mic'd up. There's not a lot of background audio, sort of accent wise. There's nothing that challenging. The areas where we will fail here will probably be if I start mumbling, if I start using some funky vocabulary, so sort of things that aren't in the dictionary, the diarization is going to be picked up off each mic separately. You could do a pretty decent job of this, but this isn't

Like this isn't the real world. mean, this is the real world, but it's not, it's not fully, this is like a small representation, but there are so many different things that you could add onto this where the problems still remained unsolved. But to be honest, the hard problems are the funnest ones to have because they show the ability to really be able to differentiate. And of course, even if even sort of solved in and of itself, much like we're seeing with LLMs, there is still a question around solving things efficiently too. So you've got LLMs that think really well, but if they're costing you an absolute

load of tokens, then that's not usable every day or sustainably, or at least at some point. But obviously they're making those systems more efficient to deliver the same quality of output. there are so many things still to attack. But I also fully take the point that for a lot of use cases, you can get pretty good with a probably model that's sitting on device. That long tail keeps on growing as well. it sounds like the low hanging fruit is a bit more solved, but then as you mentioned,

When you get into real world scenarios, noise, background noise, multiple speakers, mumbling, it changes the game. That's not the solved problem, right? Yeah. And accents and dialects and then sort of like, there's this request, like, people want things to be simple. They want multilingual models. They want models where you don't need to say what language you're speaking in, what the accents are of the speaker. They can just turn it on, press go, and it understands everything. That in and of itself, that drops the accuracy. If you have a model that runs a hundred languages, it will reduce the accuracy.

a little bit. then how do you raise that to sort of the quality that you would have got from a single language model beforehand? If you want to sort of go like use case specific, you could use a model for a certain use case or now people are training models with LLMs on the back end where you can sort of access like a different section of the overall model itself. Again, there's always going to be something which is driving the innovation and the need for improvement there. What's a voice AI capability that feels close?

but not quite there yet, especially for someone like you that's really embedded into this. Speech to speech conversations feel great. They are moving closer and closer, but for for regulated enterprise grade, not there yet. You still want a cascaded approach in my view. Having that sort of that central pillar of text to be able to control things to guardrail is really important. And I'd also say the on the emotion side as well. It is getting there, but I still don't feel like I've not.

I've not had an interaction with a system that is able to really, I mean, we're known for dry sarcasm in England as well, right? I haven't captured something that seems to me that really captures an understanding of how I'm saying something effectively. makes sense. Not only the intent, but the inflection of it all. And it's not quite there. When you say speech to speech, that also gives me some flashbacks because playing with speech to speech models, you realize very quickly that the audio

in and out doesn't always match the text out. Right. And, in highly regulated industries, you just can't, you can't have that level of deviation. Even if it's one word, it brings into question everything. So yeah, I'm going to stand firm with you on the cascading. believe that's really like, we're going to continue to see speech in speech out, but how that happens in the middles. think, I think speech to speech will come. And I don't think it's far away, but I

I don't think it's going to make a big impact in 26. The way I see it being used is quite interesting. Speech-to-speech gives you a really natural, nice experience. The cascaded models are moving that way too. Speech-to-speech gives you great low latency. But I see a world in which people are honestly switching out between models. So maybe the beginning of the interaction is a speech-to-speech model, works out what you're trying to be doing, and then it moves out to a cascaded model. We've seen that with a few of our partners, but...

for the real enterprise grade, grade stuff. Speech to speech not there yet.