Hermes Frangoudis (00:07) Hey everyone. Welcome to the Convo AI World Podcast, where we interview the builders and founders pushing the voice AI space forward. Today, we have a very special episode for you. A masterclass in speech to text. We're speaking to the researchers, developers, and founders revolutionizing the way machines understand audio and translate that into text that can be used by large language models and AI systems. Let's dive in. Hermes Frangoudis (00:32) Well, I'm so glad you brought up pipeline because this is really an opportunity where we can talk about like, let's define the tech stack. Let's break it down. What is the first piece of it, right? Like the speech-to-text, what's happening there? What is ASR and like noise cancellation, voice activity, the VAD. Like let's dive into a little bit of that. Ben Weekes (00:48) Yeah. So the three main components of a voice Conversational voice pipeline are the speech-to-text or sometimes called ASR. They're sort of interchangeable, but they mean slightly different things. And that's where the computer gets to understand what you're saying. And that gets turned into text. The second part is then being able to take that text and come back with an answer. We've all played with ChatGPT online. You know what it's like. You send it some texts. It comes back pretty quickly with a pretty decent answer. You can ask it to keep that. limited to a certain number of words and then those words can get turned back into voice again and spoken back to you. And that's what's known as a Cascading Pipeline. That's one sort of field of this. The other area is what gets called a Real-Time Pipeline. They're both actually Hermes Frangoudis (01:32) So is that the voice-to-voice pipeline? Ben Weekes (01:34) Yeah. OpenAI coined the phrase real-time, even though the cascading pipeline I just described is actually just as fast. So it's also real-time, but real-time as coined by OpenAI and also Google Gemini does the takes in voice directly into the LLM and sends voice back out again. The idea, the idea being that it's even faster. But even though the speeds are similar, it does have other advantages in as much as the LLM can actually hear the emotion in your voice. It can comment on your pronunciation of certain words in the language intuition set up. And it's able to format the tone of the output of your voice to actually match the sentiment of what it's saying a little bit stronger. Hermes Frangoudis (02:17) we were interested in. Even though it's a bit of a black box system, it has its advantages in taking the audio in first and dealing directly with it versus the cascading that's more. Ben Weekes (02:24) Yeah. In theory, but even with cascading, when the speech-to-text happens, you can put in, or some speech-to-text engines will put in metadata to describe how the person spoke it. Like, was it sad? Was it energetic? Those types of things. And also when you send the text from the LLM output to the text-to-speech engine, you can include markup, speech markup to tell it which bits should be pronounced in certain ways. and again control the emotion in your voice that comes out at the end. So actually, they're pretty similar. even though no one really knows exactly unless you work at OpenAI or Google Gemini, I think the implementation behind the scenes of their real-time voice-to-voice models is kind of around a cascaded model anyway. Although interestingly, when you look at the text coming out of OpenAI along with the voice, they often mismatch. So I might hear Hi Ben, how are you? But in the text it's saying, Dan, how are you? So they're not perfectly in sync. It's like they're using two different systems to create the effect of a pure multimodal LLM. And by multimodal, we mean capable of speaking text, voice and video all at the same time. Hermes Frangoudis (03:34) So the text is more of like an approximation as to what it said versus like a word for word transcript. Ben Weekes (03:37) Yeah, they're basically applying Speech-to-text on what goes in and out of the LLM separately to what the voice LLM is actually doing. And they're not perfectly in sync, which is a bit misleading at times. At least with a cascading pipeline, you know that what you see in the text is 100% going to match what was said and what it thinks you said and what it's trying to say back. Hermes Frangoudis (03:46) Very interesting. ⁓ Yeah, I think that transparency is pretty key though, right? Because the black box method has some of those limitations. You don't get the same access to that information for auditing purposes and set up in that sense. But you also have kind of like limited control, right? You have to use whatever function tools that they have put in place there and that sort of thing versus cascading. I feel like you could put your own. Ben Weekes (04:21) Yeah, so with cascading, you can mix and match your providers like my hottest pipeline right now for the language learning use cases to use ElevenLabs for speech-to-text, even though they're known for text-to-speech, but their new speech-to-text is really good for multilingual. Then into GROQ for your LLM, probably Lama 70 billion model, which is super fast. You know, it's on specialist hardware, so it replies really quickly. And then into OpenAI's text-to-speech, which is able to mix languages and has really good emotion built-in, different accents, dialects, it's really quite good at telling jokes, it can whisper, it can do this type of thing. Hermes Frangoudis (04:57) That's a pretty interesting pipeline, right? Like completely from left field in terms of what each of those providers is really known for, right? You would think OpenAI would be maybe more the LLM of choice and Ben Weekes (04:59) Yeah. the other one. Yeah, Hermes Frangoudis (05:10) People often assume that speech recognition and speech is kind of like this solved problem. What do you have to say to those people? Ricardo Herreros Symons (05:13) I reckon I've been told speech has been sold pretty much every year since about 2017, 2018. And don't get me wrong, it is sold for more and more use cases, but what is the long tail then becomes sort of the high volume. People's expectations grow, stakes as they sort of move into sort of higher risk and higher stakes industries become more important. So there is still a long way to go on it. Sure. If I want to have a sort of a simple, I mean, this conversation, right, this conversation is pretty well miked up. not a lot of background audio, of accent wise, there's nothing that challenging. The areas where we will fail here will probably be if I start mumbling, if I start using some funky vocabulary, so sort of things that aren't in the dictionary, the diarization is going to be picked up off each mic separately. You could do a pretty decent job of this, but this isn't the real world. I mean, this is the real world, but it's not fully, this is like a small representation, but there are so many different things that you could add onto this where the problem still remained unsolved. To be honest, the hard problems are the funnest ones to have because they show the ability to really be able to differentiate. And of course, even if even sort of solved in and of itself, much like we're seeing with LLMs, there is still a question around solving things efficiently too. So you've got LLMs that think really well, but if they're costing you an absolute load of tokens, then that's not usable every day or sustainably, or at least at some point, but obviously they're making those systems more efficient to deliver the same quality of output. there are so many things still to still to attack. But I also fully take the point that for, for a lot of use cases, you can get pretty good with a probably model that's sitting on device. That long tail keeps on growing as well. Hermes Frangoudis (07:33) So it sounds like the low hanging fruit is a bit more solved, but then as you mentioned, when you get into real world scenarios, noise, background noise, multiple speakers, mumbling, it changes the game. That's not the solved problem, Ricardo Herreros Symons (07:56) Yeah. And accents and dialects and then sort of like this request, like, ⁓ people want things to be simple. They want multilingual models. They want models where you don't need to say what language you're speaking in, what the accents are of the speaker. They can just turn it on, press go. And it understands everything that in and of itself, that drops the accuracy. If you have a model that runs a hundred languages, it will reduce the accuracy a little bit. So then how do you raise that to sort of the quality that you would have got from a a single language model beforehand. If you want to sort of go like use case specific, you could use a model for a certain use case or now people are training models with LLMs on the backend where you can sort of access like a different section of the overall model itself. Again, there's, always going to be something which is driving the innovation and the need, the need for improvement. Hermes Frangoudis (07:57) So I love that we're talking about data because data is really at the core of deep learning and that high-quality data is kind of like the grail of deep learning. And like you said, the better the quality of the data, the smaller the model, the more accurate it can be. So can you walk us through a little bit of that model training process, like the data sources, pre-processing, that sort of like architecture? Andrew Seagraves (08:24) Yeah. I'll start with like a big picture. How do you train like a state-of-the-art speech recognition model today? I think like the simplest way to do it is that it's a two-stage process. And this is typical across many deep learning models. You have like a pre-training stage and a post-training stage. So there's parallels to this in LLM training, the way the LLMs are produced. Kind of the same thing in speech recognition. So in the first stage, you're trying to train with a very large scale of data. As much data as you can get your hands on, covering as many voices as you can get, as many acoustic conditions as you can get, and then as many examples of words being spoken. So this is something that is maybe like underappreciated, that you want the model to have a very broad exposure to speakers and audio conditions. But when you do that, you also increase the frequency of particular words relative to others. So there's this interesting scale effect that when you scale up the data, the frequency of stop words, like most frequent, like ANDs and THEs, it explodes. And then, you have an emergence of a very long tail of rare words that appear. So basically, the best you can do is just keep scaling the data as much as possible, in the first stage, and you train a model on that. So in that case, you're training primarily on data that's crawled from the web. And then you're filtering that data to the best of your ability to isolate audio that has human transcripts, where the human transcripts are good, basically. So that's the name of the game for the first stage. And you get a model that is pretty good, I would say. And that is how, for example, Whisper was produced. Hermes Frangoudis (10:05) Okay. Andrew Seagraves (10:05) Whisper is like the first stage of a production grade speech-to-text training. In the second stage, you specialize the model in like in post-training and you train it on a much, much smaller, more narrowly distributed corpus that covers just the domains that you, that you largely, that you care about. Yeah. You focus the training and in that case, the data is like very high quality. It must be like a basically gold, gold ground truth labeled by humans that are following a very prescriptive style guide so that the labels are sort of consistent. In the first stage, you have labels that have been generated by millions of different humans with no consistency in style. And so then you have to unlearn that in the second stage. And the model's output becomes Hermes Frangoudis (10:48) Forget everything I taught you. Andrew Seagraves (10:49) stage. And the model's output Hermes Frangoudis (10:50) Forget everything I taught you. Andrew Seagraves (10:51) consistent. Yeah. And so that's basically how it goes. The magic really happens in the second stage, although the first stage is also important. So one thing we see, one thing I'll say, so this is like something we've observed that we haven't like published I would say, is that as you scale the corpus in the first stage, there will be a set of words that you've seen, let's say like 10,000 times or 100,000 times. And those words, like you sort of saturate the model's ability to predict those words in that if you were to show the model more examples of those words. It wouldn't help. And then you have this long tail of words that you've seen less than that threshold, let's say 10,000 to 100,000 times. And word error rate depends just directly on how many times you have seen those words. Yep. Hermes Frangoudis (11:32) It wouldn't help. It's pretty crazy. So like, Whisper, essentially, the thing that everyone claims to kind of be building off of is just like part one of the puzzle. So if you're not applying part two, you're not gonna get the right performance that you should be getting out of this. Andrew Seagraves (12:00) Yep, that's right. And you get all kinds of what we call model pathologies that result from part one. The model will insert words that aren't there. The model will omit words that are actually present in the audio, what we call shyness. Hermes Frangoudis (12:14) I've seen this and I've had people tell me, no, that doesn't happen. It never happens. it's like, but it totally happens. And you get like one thing in the audio and then one thing in the transcript and you're like, they don't match. Andrew Seagraves (12:19) So totally happens. Yeah, and it's like, customers react to those two failure modes differently. When the model's inserting words that are not there, It can be very creative. And so you sort of never know what it's going to say. Oh boy. And so that leads to lot of failure modes, depending on what the model's using to say. But then the silence is just universally despised by all customers. Hermes Frangoudis (12:36) boy. boy. Andrew Seagraves (12:45) The model should produce something when words are happening. And that's a big one. Yeah. Hermes Frangoudis (12:50) That's super interesting to hear. I feel like sometimes missing a word is also a little bit more forgiving because you're like, all right, I just kind of missed the word. It didn't like add to, like it's a lot more jarring when there's words added to it. Andrew Seagraves (13:04) Yeah, that is more jarring, especially if you're seeing it in real time. It is kind of interesting that, like, real-time speech recognition is just more challenging than batch, because you're operating on, like, small chunks of audio at a time, in principle, when you really shrink down the buffer of audio that you're sending. And so that's just harder. You have less context. You don't get to see what is coming in the future. In Batch, you could see the whole file at once, in pathology. Hermes Frangoudis (13:28) So are you constantly building that? As it goes, you're building on the previous buffer that you got? Andrew Seagraves (13:35) Yeah, and there's a tremendous range of different things you can do there from a modeling perspective. But largely speaking, you're maintaining some state about what you've seen so far and updating it as new audio comes in. And then the model may be doing something like deciding whether or not it's going to emit a prediction at this frame or not. And so real-time opens up all kinds of interesting mechanics. But model pathologies are also more prevalent. And then also, real time is the setting where people are watching the transcription live and they see them. So, yeah, for all these reasons, real time is way harder. Hermes Frangoudis (14:09) You don't have to tell us we've been in that, that space. Yeah, we're well aware. And it's not just like talking to an LLM, talking to a person and real-time voice and video streaming. That's like another one of those things where someone's talking, you expect it to happen. And think that's like one of those like fine tuning green things in our human pathology of like, we're expecting this to go along with that. Hermes Frangoudis (14:10) Yeah, no, it's really exciting where we're seeing the trends and stuff like that. And you brought up a good point. Agora is upskilling our network essentially. Like we focus traditionally on human to human interaction, but now there's a lot of opportunity between human and computer. So as you said, voice and video is the new interface and power the best voice and video infrastructure in the world. I also want to hear about what you're building with voice and video because I know you're building some really cool stuff on bringing this all together. You're using Agora, but can you maybe walk us through some of the awesome pieces of software and systems you're putting together? Ram Katamaraja (14:46) So I like Agora for its ability, its enterprise-grade ability to stream billions of minutes. So it's an enterprise-grade infrastructure that has been tried and tested. And then when it's related to the conversational AI, there are many tools out there. Some of the notable ones are related. Eleven Labs is a pioneer. And OpenAI, obviously, you have it. Microsoft Azure voice libraries are there and there are many other conversational AI platforms that are coming up, which is like super exciting in this space. However, what I like about Agora is it's a tried and tested ⁓ enterprise-grade framework on top of it. And it is being built in a way that we can integrate with various systems. I mean, if you don't mind, I can quickly share something. Hermes Frangoudis (15:34) It's built like Lego blocks, gives you all the control while taking away all the headaches, right? Ram Katamaraja (15:40) Yes. Do you see my screen? Okay. All right. So what you're seeing here is like traditional, ⁓ agora framework, right? So you have the, ⁓ agora platform have, sip web RTC conversations, and it has ability to integrate enterprise systems in CRM systems, call records, like everything. ⁓ and it can stream, it could do all kinds of like really cool stuff related to the text, voice and video streaming, right? On top of it, the conversational AI is being built and this conversational AI is more like you said, it's a Lego block which allows to integrate like with any other systems. So for example, this is like a stack that I'm just showing. So for example, at Colaberry, we have the AI orchestration platform that has a way you can fine tune LLMs, you can fine tune text to speech, translation services, MCP connectors, and all kinds of orchestrations and everything. So these architectures could be used to enable basically ⁓ Convoy AI. So let me just show you this particular flow diagram a little bit. So here, as you can see, so this is an app where ⁓ you have a real-time audio and real-time audio streaming can happen. And here in the backend, You can integrate with agentic flows here. So for example, here you can integrate with 11 labs, text to speech, right? You can integrate with open AI whisper modules, or you can integrate with Microsoft modules, right? And in the backend for an enterprise, which is like, you have like a lot of custom intelligence inside the company, right? So for that, you may need like a large LLM models and you have custom databases. can build like rag, rag tools and everything, right? So the Agora platform, by being the front end and having the ability to plug in any of these backend systems, allows to build an enterprise-grade text-to-speech, speech-to-text conversational AI platform. also, so any questions here? Hermes Frangoudis (17:46) In here, zoom out maybe a little bit, the person calls in. So this isn't like your traditional app. This is a dial-in number. So they're interacting with a traditional interface in terms of like the mode of connection, right? Ram Katamaraja (18:00) Yeah. So the beauty of Agora platform is it can work with both phone as well as well, we OIP, right? Regular telephone and we OIP. So you have a sweep switch, right? So, and with the switch switch, can basically stream the voice across the regular telephone line. But if like in this particular one, you could just have like a client, Android client or a Voip Client. So it allows like both of those, right? So, and this whole Agora infrastructure is already there and it works at the enterprise grade. So all we are doing is making it AI enabled because previously these are not necessarily smart conversations. Now with integrating with AI platforms, they become smarter conversations. Hermes Frangoudis (18:48) Very interesting. You're using Agora as kind of like the underlying level of the streaming orchestration and then tuning it into your pieces that you've been building over the years that your enterprise's customers have been building over these years. The knowledge database, the knowledge base, their MCP servers, all these things that kind of can exist without voice, but now being able to bring it to this new interface sort of. Ram Katamaraja (19:12) That's correct. all these interfaces already exist in the internet 1.0 and 2.0 world. Like now, how do you upgrade to the AI world where it's smarter? So that's where the enterprise may already be using a good platform for their call centers ⁓ and all kinds of streaming needs that they have. And the calls may be already getting recorded, et cetera. Now we can. Uh, we, we, uh, we can integrate them with the AI infrastructure and make the whole, uh, business smarter. And maybe they can even innovate like new products and services on top of it. So for example, they could have, uh, uh, they could have automated, I don't know, like if you take a healthcare industry, confirming appointments is a big thing. Following up with the people is a big thing, like prescription medications. All these are like usually instead of like humans following up or even robotic. calls being made, like these can be like lot more humanistic empathetic calls because the AI interacts. It's just not delivering information. Yeah. It has the ability to interact so you can upgrade and make your, make the processes like lot more. Hermes Frangoudis (20:21) away from that traditional phone tree. Hi, we're calling to remind you about your insert, whatever press one. Right. So this is like having a natural conversation, which for most people is, all they really want to be able to do when they call in. Like, I don't think they really care that it's a person or not. It's I don't want to have to like change the way I interact and speak to accommodate their phone tree or their technology. that just makes my life more difficult, right? So this is an ability to really bring that human interaction back into things. Hermes Frangoudis (20:22) Yep, that's another one. Like how much do I relate to this voice, connect with it, feel that it doesn't feel cold, I guess. I don't know what the description would be there, but that's an interesting point. Sorry, you had me thinking there. You mentioned in the relatability, the ability to pronounce certain words. And I noticed like different voice models will pronounce or mispronounce the same word. Like, even though it would be a word that I would think they should all pronounce the same, it's funny to watch some stutter on it or completely mispronounce it versus others that hit it on the head. Lily Clifford (21:02) I think like, you know, if you throw like a list of extremely challenging brand names to any text-to-speech model, you're going to find that some of them pronounce them right? And some of them pronounce them wrong. And that set is different for each model. At the same time, we're often selling to teams of enterprise developers. We saw the enterprise teams of developers are building really high-volume calling applications for, in many cases, Fortune 100 businesses. And you can imagine, right? It's one thing to mispronounce a word. It's another thing to not have predictability over whether a word is going to be pronounced correctly or not. And so like a lot of people build models. I would describe them as like general-purpose models, speech models where like they're really good at reading things out. Like, but what you lose from having this, this like less rich, what you lose from having less richly annotated data on the phonetic level, right? Like I'm talking like having a phonetic transcription system for the data, and that's what you train the text-to-speech model on, not just how the words are spelled, how you and I would spell them. Then what you lose is predictability and control. And what I mean by that is if you send Häagen-Dazs to our API, maybe, I'm not sure, maybe we pronounced it incorrectly today. But we want to build models where we can tell you A, that we are less confident that we're going to pronounce it correctly. And B, if we're not pronouncing correctly, you can fix it immediately without us having to retrain the model. Hermes Frangoudis (22:28) So being able to pass that like the pronunciation annotation kind of with it if it's not, if it doesn't feel confident enough or yeah. Lily Clifford (22:36) And by the way, yes, exactly that. And at the same time, like this is a feature that's existed in text-to-speech models since the very beginning of text-to-speech models. And at the same time, no one else has built like workflows around telling you, right? Like here are the words we're not pronouncing correctly today. Otherwise, you just have to like call yourself and guess. And by the way, this wasn't a problem like pre LLM, because if you're talking about this IVR maze, right? Like everything's basically like, pre-generated. You can run QA. Once you run QA, that's it. You're like, everything sounds good. If we're not pronouncing Häagen-Dazs correctly in the phone tree, you pass the international phonetic alphabet, you use the text-to-speech model to create audio, done. But like now in the era of LLMs, you don't even know what the voice agent is saying before it says it. Hermes Frangoudis (23:18) It's completely unpredictable. So the ability to pre-QA it has gone out the door, right? So it's more about like how do you pronounce it predictably? How do you ensure? That there is confidence or no confidence. Lily Clifford (23:22) Yes. Correct. And really this like the Rime thesis is like, if you don't have a path to a 100% accuracy, then enterprise won't adopt. I'm not saying you have to have a 100% accuracy because we're never going to have a 100% accuracy. But if you don't show teams and developers who are building voice agents a path to a 100% accuracy, they can't build a product really. Yeah. They can't build. Right. And so like, if you're, you know, Providence Medical and you're building a voice agent for doing like, Hermes Frangoudis (23:50) they're not gonna feel comfortable. Yeah. Lily Clifford (24:00) genetic counseling screening, you have no idea what the patient on the other end of the call is going to say, right? Maybe they have family history of cystic fibrosis. You never thought about cystic fibrosis before in your life. And then the voice agent mispronounces cystic fibrosis and you're like, this is the opposite of an empathetic clinical experience. Do you know what I mean? Hermes Frangoudis (24:01) so precise than that. Ziyi Lin (24:02) Yeah, VAD essentially acts as the traffic controller here by detecting both the start of sentence, SOS, and end of sentence, EOS, as I mentioned before. It is what triggers the entire chain in the real time. So for example, once VAD picks up the end of sentence, like 200 milliseconds silence, this time we can adjust based on our needs and it signals the STT system, so the speech-to-text system to finalize the transcription. And this transcription will then be sent to the LLM and the LLM outputs then triggers TTS to synthesize a response. So, this is why low VAD latency is very important in the conversational AI application. Hermes Frangoudis (24:57) Yeah, it's like that first, like you said it best. I've never heard it said that way, but I really like how you put it. It's the traffic controller. It's like, "let's go, let's go. You're coming in, you're coming in. We're done. We're done". That's it. Time to process, right? End of sentence and end of speech. Let's get it moving. But another critical point of that traffic controller is also to be like, "Oh wait, stop this whole process. You know, we have, Ziyi Lin (25:06) traffic controller Yep. Hermes Frangoudis (25:27) We have an interruption". So VAD has to like, you know, the start of sentence, right? Ziyi Lin (25:34) Yeah, the start of sentence controls the or affects the interaction latency, I think. And the EOS affects the whole end to end or response latency. So it's very important. Hermes Frangoudis (25:52) Yeah, it's such a critical piece. And then... When you see TEN being run, right? And this VAD, is it more edge or full cloud? Like how important is it where you run it and given like dependent on the use case? Ziyi Lin (26:11) The VAD model is essentially, it's basically very small. It is very small, only several 100 kilobytes. Typically, it can be run on edge devices to ensure low-latency detection of voice activity. Yeah, and And users hate delays when interacting with agents. Running VAD on the edge just cuts down the lag because it is right there on the device. Detecting speech starts and stops instantly, immediately. And another thing is about the bandwidth and the cost savings. Yeah. If the VAD is on the edge, it only sends audio frames that actually have speech to the STT, speech-to-text or at the ASR system, whether that's on the edge tool or in the cloud. And it can help you or help the users to save the bandwidth costs, the STT costs and keep things efficient. Hermes Frangoudis (27:20) Yeah, that's huge. The STT cost is something you don't want to like mess around with, right? Because you don't Ziyi Lin (27:26) Because you don't want to just send the noise or send the non-speech segments to the STT systems because it is not meaningful. Hermes Frangoudis (27:27) I want to go back to something you kind of touched upon and bringing in all this multimodal data, right? And then being able to get the model to understand how it brings all these kind of markers together, but then uses voice as the actual like input of the biomarker to connect the dots for this new individual. Can you tell us about like how important that that is and like what, what it can really tell us about our health? Emilia Molimpakis (27:48) Yeah, so voice, turns out, like those different modalities that I spoke about, content, acoustics and so on, it's extremely powerful. I mean, I knew this from my research, but I don't think we ever anticipated just how powerful it could become with the power of AI and the right dataset and so on. So voice is super powerful, so very strong. Like you can see multiple facets of somebody's state of body, state of mind. But the other big asset of voice is it is everywhere. It is ubiquitous. Everybody in the world, or rather, not everybody, but the majority of people in the world will use voice to communicate. So it's very natural. It's very comfortable. It's an obvious way of actually communicating. And that also means it's very easy to access. People naturally speak. They don't naturally necessarily turn a video on or anything but talking on the phone, et cetera, people are very comfortable with it. And it's not as privacy-invading as a video. So it's that combination of ubiquity, ease of access, ease of collection, the strength within the signal itself that meant we moved from being voice-video behavior to just being voice. The one problem we have had with voice and everybody has it in voice AI is it is so strong. It carries so much signal, not just about health, but about everything that or about everything around who you are as a human. So your age, your birth sex, your accent, your background, ethnicity, how many languages did you speak until the age of nine that affect how you sound now, where you live. So that is a very strong signal. And you need to differentiate that from what is the actual signal, say for depression, fatigue, stress, diabetes, differentiate out and isolate the signature that is a condition and that's kind of what we specialize in doing. Hermes Frangoudis (29:31) Very interesting and it comes multilingual. Emilia Molimpakis (29:35) Yes. So because we started multimodal and we have so many ways of covering the gaps, our models do work across different languages. We typically need to fine-tune potentially a little bit for a new language, but we've done it now across Greek, English, Spanish, Italian, Brazilian Portuguese, Indonesian and Japanese as well. So it's expanding now to all different languages. Hermes Frangoudis (29:58) And those are very, like when you think about the style of language of each of those, very Emilia Molimpakis (30:02) Yeah, they are. They are. Yeah. Hermes Frangoudis (30:05) intonation to tone to just style of speaking, right? Like some people speak really fast, just naturally. Some people are real slow, loud. Some people just always sound angry when that's just like culturally how everyone sounds. Emilia Molimpakis (30:14) Yes. Yeah. My co-founder Stefano always sounds angry. He's Italian and he's loud. He's angry. There's a lot of gesturing. And we can still tell what is happening. One key thing we do, which is also very important is yes, we kind of adapt across different cultures, but when we are working with users or individuals, initially we compare them to other people in the dataset who will match them or potential confounds like age, gender, and so on. Hermes Frangoudis (30:15) What are the most common misconceptions people might have about speech recognition? Andrew Seagraves (30:37) I would say the biggest one is that it is a solved problem. I think even Jensen has said that in like his most recent GTC keynote, he said speech recognition is a solved problem. And I think that it is definitely not a solved problem. It's only "solved" in some very narrow situations. So like it works well in situations where we have a ton of data. So there's particular use cases like call center audio, for example, in English. We have a lot of data. We can train models at scale. And this paradigm of large expressive deep learning model trained on lots of data, it works well. And then we've, over time, collected enough data across a lot of niche domains, or what would have been considered more niche domains a few years ago, in English. And so the models have gotten really strong in English. But I think that in non-English, languages, the models are in general still pretty terrible across the board. And it's just because of the lack of data. And then I would say beyond just having data to cover the very broad range of speakers and acoustic conditions that you're trying to model, the other big challenge that is not solved is being able to recognize rare and localized words, like words that are specific, say, to a particular customer or a particular person. Like how their name is spelled, for example. Words will continue to be a challenge moving forward, like getting the words right. It's like one of the core challenges that's sort of underappreciated, I would say, in speech to text. Hermes Frangoudis (31:48) I feel like it leaves a lot of room to grow, right? Like there's still very early days on this sort of thing. That challenge is kind of what you guys have solved on the English side of it. And I'm sure you're going after international and like colloquial, like localized dialects and stuff like that. terms of how you Andrew Seagraves (32:06) Mm-hmm. Yep. Hermes Frangoudis (32:07) And then going on to Meta and Facebook and getting to deploy to millions of people in production. So you really understood what it takes to scale this kind of technology, not only from research but into production. What problem do you think was the problem that really stuck out to you in terms of speech recognition or translation that you didn't feel was really being solved at the time? What gave Soniox that early opportunity to come together? Klemen Simonic (32:31) Yeah, I mean, the key thing about speech-to-text speech recognition is accuracy. And it sounds simple, but it's actually quite complicated. And when we say accuracy, I don't mean accuracy just for English, okay? And just for like clean speech environments where you have really no noise and super high quality audio. When I say accuracy, I mean just like speech AI that's built for real world. And that it's not just meant for English, but it's meant for all languages, or at least basically eight billion people around the world. So which means that you need to address about 60 languages, and you have to achieve extremely high accuracy for 60 languages. And the problem that comes with achieving high accuracy is like how do you basically recognize a minority language with very high accuracy, as much as some, a lot more popular languages while having tons of less labeled data or almost no labeled data. And what really struck out to me like at Facebook, because we are so early on in the speech recognition, we build speech recognition there for like, know, 10 systems, 20 systems, different languages, is the amount of, you know, basically human-labeled transcription data that you need to train a system. And back then it was clear that the more data you have, and this was the year 2016, 2017, the more data you have, the better the speech AI system was. And it was clear if you had, for example, 15,000 hours for English, English was so much better than if you had just 500 hours for another language. So the key question that we stumbled is how can you level this, equalize across all languages? So to give not just English, but basically 60 other languages equal opportunity to use voice, speech in their applications and still work really, really high accuracy. So really the key question is multilinguality, accuracy that comes with multilinguality. How to achieve this problem? So it is clear to me that this human labeling of what's been happening and still is happening is not really a way to get to like this native-speaker accuracy, which is required. So If you want to have a voice-driven application that works naturally, you have to understand almost every word. So you can't have a word error rate of, I don't know, 10%, 20%, and every fifth word or every tenth word is misunderstood. So you wouldn't really use such an application. So basically, at the very start, when you start with Soniox, the key question was how do we use unsupervised learning or so-called self-supervised learning now to use insane amount of internet and other available data to pre-train models in a way that it can understand a lot of Danish, Finnish, Arabic languages and be able to recognize them very accurately despite the fact that there is very little human label data for it. And we really pushed hard on this from the very beginning. This is what gave us an edge. And that's why today we have really native-speaker accuracy for 60 languages. You can go to Japan. You can speak in Japanese, in Korean, Taiwanese, English, obviously, Spanish there, Portuguese, German, Italian. I mean, really, Most languages around the world with some significant population are covered with our model and that recognizes them with Hermes Frangoudis (36:00) Super interesting. And you touched upon there something that I want to dig into a little bit. You mentioned unsupervised learning and labeled data. Like one of the things that's been a constant theme, I would say, across all these podcasts is when you talk to leaders in the space, everything comes down to the data, the labeling, how it's connected. So at a high level, like how does Soniox approach this into like a single speech model across so many languages? Ricardo Herreros Symons (36:01) speech to speech conversations feel great. They are, they are moving closer and closer, but for like for regulated enterprise grade, not there yet. You still want a cascaded approach in my, in my view, having that sort of that central pillar of text to be able to control things to guard rail is really important. And I'd also say the, on the emotion side as well, it is getting there, but I still don't feel like I've not, I've not had an interaction with, with a system that is able to really, I mean, we're known for dry sarcasm in. in England as well. I haven't captured something that seems to that really captures an understanding of how I'm saying something effectively. Hermes Frangoudis (37:02) Makes sense. The, ⁓ not only intent, but the inflection of it all. And it's not quite there. When you, when you say speech to speech, that also gives me some flashbacks because playing with speech to speech models, you realize very quickly that the audio in and out doesn't always match the text out. Right. And in highly regulated industries, you just can't, can't have that level of deviation. Even if it's one word, it brings into question everything. So. Yeah, I'm going to stand firm with you on the cascading. believe that's really like, we're going to continue to see speech in speech out, but how that happens in the middles. Ricardo Herreros Symons (38:03) think speech to speech will come. I don't think it's far away, I don't think it's going to make a big impact in 26. The way I see it being used is quite interesting. Speech to speech gives you a really natural, nice experience. The cascaded models are moving that way too. Speech to speech gives you great low latency, but I see a world in which people are honestly switching out between models. Maybe the beginning of the interaction is a speech to speech model. works out what you're trying to be doing. And then it moves out to a cascaded model. We've seen that with a few with a few of our partners, but for the real enterprise grade, great stuff, speech to speech not there yet. Hermes Frangoudis (38:04) take the approach for speech-to-text. Really, how does Deepgram like balance this real-time latency with the accuracy of the transcription? Andrew Seagraves (38:21) That's a great question. So there's two major use cases for speech to text. And in your question, you're sort of honing in on real time, so where you want the model to be actually transcribing the words as soon as possible after they're uttered by the person. And that is just a very hard challenge. The other major use case is batch or asynchronous transcription. And that's where you have like, you know, your 10,000 calls for the day that were recorded at your call center. And you want to be able to transcribe all of them in as, maybe like 30 minutes or an hour or more, like as quickly as possible. So that you can get the insights about what happened today. And so I would say like, you know, Deepgram for both of those kinds of problems, we have always worked in a constrained design space where we are trying to achieve the maximum accuracy subject to like, it must be fast and scalable, some like very concrete performance or engineering requirements. And when you operate in this constrained design space, like there are many approaches that do not work well, that would otherwise look great, like in papers. You know, models that don't scale. Also like very, very large models become impractical, if you're trying to hit a latency. And so it's pretty simple. Small models, constrained design space, and then actually imposing those requirements from the beginning so that you've designed with scale and speed in mind. Hermes Frangoudis (40:26) So really having that kind of from the start, not taking things that will, yeah, they work in batch because batch doesn't matter, timing. Andrew Seagraves (40:37) I I would say, so I'll say like one more, go like one level deeper there, double click, if you will. The data actually impacts being able to solve that problem. So the more high-quality data that you have and the more localized it is for whatever particular domain you're trying to model, the smaller you can make the model for a given level of accuracy. And so there's like a joint dependence there, if you want to think about it. The better the data you have, the more efficient model you can use. so we can leverage our data advantages that we have. So we have collected and labeled a lot of high-quality data to actually make the model smaller. Hermes Frangoudis (40:38) in one direction. That's awesome. Earlier, we were talking about like the flavors and there's only one flavor, right? The Wonder Bread. How does Rime approach modeling for like underrepresented maybe like dialogues or speech patterns, like things that are not that common, but they exist enough that when you're an enterprise, like you need to account for these edge cases or these things. Lily Clifford (41:00) And you would be shocked at like how, like we describe it as a long tail, right? But you would be shocked at tail-like it is such that like you have, right? Like one of the most prominent like agent builders for customer experience in 2025 saying they can't get high quality Castilian Spanish voices, like Spain Spanish voices. Hermes Frangoudis (41:24) Yeah. Lily Clifford (41:25) Like literally almost impossible. Like, yeah, Latin American Spanish there, right? But like, and why is that? And they have no idea why. Like they tried this model, they tried that model, they're like, they put it in front of like, you know, large enterprise in Spain. They're like, this sounds like someone from Columbia. You know what I mean? And so I like it's such a moving target too, because like languages change all the time. Hermes Frangoudis (41:28) And you guys got it. Lily Clifford (41:46) And like the colloquial variety of a language changes all the time. And as these AI agents become more capable, expectation that they remain colloquial, right? And like sounds like people talk today will be ever increasing. And so you have to collect that data. I don't know. And you have to like train models that are purpose-built for fidelity to how people expect that people talk. And at the same time, like not only as, mean, there, by the way, there exists like Castilian Spanish models from four or five years ago, people just don't like them anymore, right? Like taste is ever changing. And with that said, like there is essentially no high quality Hindi text-to-speech model in existence today. There is essentially no high quality like Arabic text-to-speech model in production. Why is there no high quality Arabic text-to-speech model that exists? Because people built Arabic text-to-speech models for what's called modern standard Arabic, which is a, form of Arabic that no one speaks natively. It's like what every audio book would be narrated in. It would be like as if we were reading Shakespearean English all the time. And that was the only thing that existed in datasets of English. And then we're talking right now and we're obviously not talking like Shakespeare. And so like, Hermes Frangoudis (42:38) interesting. I mean, we're not even talking about like traditional English. We're talking like American dialect of English, right? Lily Clifford (42:56) No, yeah, so like the first-order problem is just like having, you know, Saudi Arabic, like generic Saudi Arabic. Hermes Frangoudis (43:04) And then moving down into the different regional. Lily Clifford (43:05) And then moving down, right? Cause like we've proven that like, if you talk to a fast-casual restaurant voice agent in Atlanta and you hear an African-American English voice that you're likely to have completing that order increases. Hermes Frangoudis (43:18) Just because of the relatability. Lily Clifford (43:20) The relatability, exactly. so like, is there anything like that happening in Saudi Arabia? No, not at all, because they don't even have Saudi Arabic models to begin with. Hermes Frangoudis (43:27) Wonder if they're doing it in Greece and Cyprus where my family's from. Those are very different dialects of Greek that it's funny when you see videos on Instagram how it's pronounced in one versus the other and it's just like, man, you don't even realize how different it is. I know, Lily Clifford (43:41) I know, yeah. No, people don't realize that someone in Saudi Arabia couldn't even understand someone in Morocco. Hermes Frangoudis (43:45) Yeah, because it's just completely different dialects of Arabic. Lily Clifford (43:47) It's It is like Spanish and Italian. Hermes Frangoudis (43:49) It's wild. You guys, obviously, with Rime power, a lot of real-time solutions for enterprise businesses, major brands. I don't know if you can name drop some, like what are some of the things maybe you learned from these deployments that's maybe surprised you similar to that fact with like in Atlanta. Hermes Frangoudis (43:50) So what motivated Deepgram to build a speech-to-text engine, kind of from the ground up using that end-to-end deep learning? Andrew Seagraves (43:58) Well, that's a great question. That takes us back to the early days of the company when it was founded and they chose to focus on that early on. That is not actually the first problem that they tackled. So the founders were originally dark matter physicists and they were doing a lot with audio. They were using audio signals and shooting them into the earth and then measuring what came back. And then trying to use machine learning to understand whether or not there was dark matter present. And so they had some very strong expertise that they were building, but applying deep learning in their particular application. This is back in 2015 timeframe. They were also doing weird stuff like recording themselves. Hermes Frangoudis (44:28) That's Andrew Seagraves (44:45) for weeks, like all of their audio, wherever they were, they attached mics to their clothing and they were recording themselves. And so they amassed this very large volume of audio from recording their everyday lives. And they were like looking at their machine learning stuff that they were doing, looking at this thing, this large corpus that they would never be able to listen to, to like find the interesting tidbits. And they decided to try and tackle the problem of like using the machine learning to search the audio, just as a side project. And that was how Deepgram started. They went and founded a company. They built a deep search algorithm. In doing so, they actually indexed all of YouTube at the time. And you could find random audio clips in a YouTube-scale corpus. They demoed on stage at GTC with Jensen. This was very early days. But they realized at that time that search was not a hot thing, and there wasn't a big market for it. Hermes Frangoudis (45:34) Okay. Andrew Seagraves (45:42) So speech recognition at that time was an emerging green field and there were very few players and all the models were terrible. And so they had this strong conviction from the beginning that if you combined a system that's learned end-to-end, a single network and lots and lots of high-quality data that you could build a model that could transcribe sort of like potentially any human in any situation. So it was like, it was just one of the early convictions that they had. They went about building an early prototype, it's like 2016 timeframe. Hermes Frangoudis (46:19) That's super interesting. Gotta love that gap in the market, right? Like everything else is terrible and you're like, actually, I think if we do this, it could solve this solution, right? Andrew Seagraves (46:31) Yep. They also kind of got lucky in that there were some early AI adopters from call centers, basically like AI platforms that had many call centers as customers. They had these huge volumes of call center audio that they wanted to transcribe and do analytics on. And that particular domain like is narrow enough that the very early deep learning models that we had actually worked. And you could train on these narrow domains and produce models that were like 80, 90% accurate. Hermes Frangoudis (46:57) Okay. Andrew Seagraves (47:04) If you just specialized the data. And that was like some of early magic of Deepgram models, that they worked for particular applications where there was an interest and a lot of data and people who were like willing to try to use the models. So that was one of the reasons that we have built for scale early on too. Hermes Frangoudis (47:05) So what are some of the trade-offs maybe between like more expressive tones or speech patterns versus like recognizability when you're designing some of this stuff? Is it purely just like, 'Hey, we're going to have these different flavors and we're just going to annotate, give them different annotations of data,' like the richness and it all kind of comes out the way it does or is there something you kind of do in that sauce too? Lily Clifford (47:27) There's still a lot of missing pieces, I would say, in like having voice models that are both highly expressive and also controllable. Like there's still such a trade-off there today. And it doesn't have to be that way. I would just say like, again, people are focusing on different things at different points in history. And we're at a moment right now where people have sacrificed controllability for expressibility or for expressiveness. And again, it's not going to remain that way for very long. Like Rime will train, you know, the most expressive speech synthesis models today are trained on top of large-language models that were, that have only seen text really. It's strange, but true. Like you can take a large-language model that saw 25 trillion tokens of text, just text, right? And then you start showing it text and audio and it learns text and audio. It's crazy, but true. Hermes Frangoudis (48:03) Interesting. So it becomes kind of multimodal. Lily Clifford (48:19) It is multi, yes, exactly. Which by the way, like no one would have ever described a text to speech model before is multimodal. Like you predict audio given text, but in that way, like text-to-speeches, like the OG multimodal. It's never not been multimodal what I'm saying, but like to your point, yes, like it definitionally in the way that people talk about large-language models, it becomes multimodal. Yes. Hermes Frangoudis (48:38) Yes, okay. Lily Clifford (48:39) The benefit of that is it saw 25 trillion tokens of language. So it basically understands something about language. Hermes Frangoudis (48:45) It's able to create those little nuanced patterns in the written language, which can then somehow help reinforce it in the audio pattern. Lily Clifford (48:51) Correct. Crazy but true. And at the same time, like those 25 trillion tokens of text data are not phonemic phonetic representations, right? They're just how you and I would spell words. And so like, there's still work that needs to be done to like post-train the text-based LLM still on text or phonemic text. And so what you're seeing now is, and essentially this is this is frontier research is like how to take this Hermes Frangoudis (49:06) Not at all. Yeah. Yeah. Lily Clifford (49:23) LLM backbone, and then post-train it to learn phonetic representations in concert with these orthographic, the written forms of text language, and then post-train it again to become multimodal. And so that's really the cutting edge. And that's how you're getting a high level of nuance and richness in the spoken language while also getting the controllability that comes with the phonetic representation. So that's the future. Hermes Frangoudis (49:47) That's amazing and that's like the bleeding edge right now like that's the frontier that is really pushing the space forward. Lily Clifford (49:53) Yeah, it's like Google and Rime basically that are doing that.