Klemen Simonic (00:00) We basically kind of restarted the team there and built the core of speech-AI system at the time. If you want to have a voice application, voice-driven application that works naturally, you have to understand almost every word. Most languages around the world with some significant population are covered with our model and that recognizes them with Imagine that you have a person, a human, that speaks 60 languages with native accuracy, fluently. This is basically our AI model. Hermes Frangoudis (00:07) Hey everyone, welcome to the Convo AI World Podcast where we interview the founders and leaders pushing the voice AI space forward. Today, I'm very excited to have Klemen, Founder and CEO of Soniox with us to talk about not only speech-to-text systems, but what it takes to become native speaker accuracy in the speech-to-text area. Thank you, Klemen, so much for joining us today. Klemen Simonic (00:30) Hermes, thank you for having me. Happy to talk to you. Hermes Frangoudis (00:32) Yeah, of course. Congratulations on your recent first place placing within the recent evals that were released. I want to give a huge congrats to your team. Thank you, yeah. Before we get too deep into speech-to-text, I'd love for our audience to learn a little bit more around Klemen Simonic (00:43) Thank you, thank you, yeah. Hermes Frangoudis (00:51) the origin story of Soniox and what originally kind of pulled you into working with speech and audio and AI? Klemen Simonic (00:59) Well, I started working with machine learning in 2008 when I was first year of undergraduate. I worked with a team in Slovenia at a research institute, and the team was called AI Lab back then when the AI word was not a buzzword yet. Yeah, and I happened to basically work on really kind of cutting-edge natural language processing since then. It was support vector machines, all kinds of clusterings, all kinds of knowledge graphs infused into the structured prediction machine learning problems. So I got to get exposed to machine learning and sequences in terms of language, since 2008, really. And then I've been always working on problems with language understanding, but speech really came into play when I joined Facebook in 2015, where I started working on speech there with a few others. We We basically kind of restarted the team there and built the core of speech-AI system at the time. That's where all of this background from natural language processing came into play for the speech. And of course, I learned many new things on the speech audio. Speech and audio world is different, much different from like text-only. So it's in so many ways harder. I've been at Facebook for five years there, and built many systems, went to production, and served hundreds of millions of users. Live captioning, part of the Facebook product then, Meta now, and all kinds of other things that were shipped there. So then in 2020, we started Soniox. Hermes Frangoudis (02:46) That's such a deep history, like the 2008, those were the early AI diffusion, neural nets, like these things were just emerging as kind of like that cutting-edge technology. And so you got to take the position of being in it from the audio instead of the text type world. Super cool. Klemen Simonic (02:56) Yeah. Hermes Frangoudis (03:05) And then going on to Meta and Facebook and getting to deploy to millions of people in production. So you really understood what it takes to scale this kind of technology, not only from research but into production. What problem do you think was the problem that really stuck out to you in terms of speech recognition or translation that you didn't feel was really being solved at the time? What gave Soniox that early opportunity to come together? Klemen Simonic (03:30) Yeah, I mean, the key thing about speech-to-text speech recognition is accuracy. And it sounds simple, but it's actually quite complicated. And when we say accuracy, I don't mean accuracy just for English, okay? And just for like clean speech environments where you have really no noise and super high quality audio. When I say accuracy, I mean just like speech AI that's built for real world. And that it's not just meant for English, but it's meant for all languages, or at least basically eight billion people around the world. So which means that you need to address about 60 languages, and you have to achieve extremely high accuracy for 60 languages. And the problem that comes with achieving high accuracy is like how do you basically recognize a minority language with very high accuracy, as much as some, a lot more popular languages while having tons of less labeled data or almost no labeled data. And what really struck out to me like at Facebook, because we are so early on in the speech recognition, we build speech recognition there for like, know, 10 systems, 20 systems, different languages, is the amount of, you know, basically human-labeled transcription data that you need to train a system. And back then it was clear that the more data you have, and this was the year 2016, 2017, the more data you have, the better the speech AI system was. And it was clear if you had, for example, 15,000 hours for English, English was so much better than if you had just 500 hours for another language. So the key question that we stumbled is how can you level this, equalize across all languages? So to give not just English, but basically 60 other languages equal opportunity to use voice, speech in their applications and still work really, really high accuracy. So really the key question is multilinguality, accuracy that comes with multilinguality. How to achieve this problem? So it is clear to me that this human labeling of what's been happening and still is happening is not really a way to get to like this native-speaker accuracy, which is required. So If you want to have a voice-driven application that works naturally, you have to understand almost every word. So you can't have a word error rate of, I don't know, 10%, 20%, and every fifth word or every tenth word is misunderstood. So you wouldn't really use such an application. So basically, at the very start, when you start with Soniox, the key question was how do we use unsupervised learning or so-called self-supervised learning now to use insane amount of internet and other available data to pre-train models in a way that it can understand a lot of Danish, Finnish, Arabic languages and be able to recognize them very accurately despite the fact that there is very little human label data for it. And we really pushed hard on this from the very beginning. This is what gave us an edge. And that's why today we have really native-speaker accuracy for 60 languages. You can go to Japan. You can speak in Japanese, in Korean, Taiwanese, English, obviously, Spanish there, Portuguese, German, Italian. I mean, really, Most languages around the world with some significant population are covered with our model and that recognizes them with Hermes Frangoudis (07:04) Super interesting. And you touched upon there something that I want to dig into a little bit. You mentioned unsupervised learning and labeled data. Like one of the things that's been a constant theme, I would say, across all these podcasts is when you talk to leaders in the space, everything comes down to the data, the labeling, how it's connected. So at a high level, like how does Soniox approach this into like a single speech model across so many languages? Klemen Simonic (07:32) I Hermes Frangoudis (07:34) No secret Klemen Simonic (07:35) Nothing. Hermes Frangoudis (07:35) No secret sauce. Just saying like, like, like, at a high level, you mentioned, like you put together unsupervised learning systems, you found ways around this, right? ⁓ Klemen Simonic (07:41) Yeah, yeah, Yeah. So we call it AI data factory internally at Facebook. We are basically utilizing, you know, the most advanced of AI models, okay, from speech to LLMs, to language identifications, to basically gather and create with AI, you know, large, training datasets, high quality training datasets on which we can then train our system. So this is, think of it like this, what's happening with supervised fine-tuning has been happening with LLMs or reinforcement learning. Now we've been doing that for three, four years now. So we started basically with that. So, and we really pushed this process to to very, very large-scale. So we're operating on petabytes, many petabytes of audio data, many tens of millions of hours. So in the case, how do you use the AI, the state-of-the-art AI, and in various ways to create the select great data to train a better AI? So this is where is the magic circle, I would say. And how do you do this? Because speech, speech training is much different from text training. So let's maybe dive here just for a second. So when you train a text LLM, what is the input text LLM? It's text. And what is the output? It's also text. And typically, at least in the pre-training stage, these two texts are actually the same. Input text and output text are the same. They're just shifted by a token. Okay, so the model really predicts based on N tokens, it predicts the next token. So the pre-training and the collecting data for pre-training is in so many ways much, much, much simpler because you don't need to link two modalities. It's actually the same. You just shift it a little bit to the right for one token or one word, whatever you wanna call it. But when you go to speech and voice. You have two modalities. You have audio signals input, okay, raw audio signal. And what is the problem with speech-to-text? Well, you have to actually predict text tokens, words, text. You don't predict speech. You don't predict the same model, it's a different modality. And if you go online, you will not find some beautifully created datasets for, let's say, let's pick Finnish, where you have tons of real-world Finnish speech and map to text, right? You will not have that. So the key is like how do you actually create such datasets where you in map, you know, like basically any language to text successfully train them. Hermes Frangoudis (10:24) Super interesting. The dual modality is definitely the big difference and the bigger challenge, right? It's really Klemen Simonic (10:28) Mm-hmm. Yeah, and it's really the hard part. It's really, really hard. Hermes Frangoudis (10:34) When you do this dual modality, there's, guess, multiple approaches, right? Like you could train a model for every language versus a universal model. And your approach is more the universal single model that understands everything, it sounds like. So why this choice versus multiple models? Klemen Simonic (10:52) Yeah, that's a great question. So we used to train and have individual models, like English only model. And then when we did other languages at first, we did them in bilingual modes. So Soniox was kind of the first, it was actually the first to offer bilingual speech-to-text system. So when we say we have a Korean speech to text system, we really had a Korean plus English. Or when we said Italian, we had Italian plus English. And then we saw how the model was able to leverage the other language to improve on the other language. So it's kind of logical. I mean, seamless now. So the concept of this transfer learning from one language to another language happens also in speech-to-text and speech AI. So then we said we're going to create this new generation of speech-to-text AI model, which is where we're seeing big traction around the world. OK, we're going to grab all the languages, like 60+ languages that we want to support, and put them into one model and train on all of the data for all languages at once. And then, as I mentioned, the sounds between languages, some languages are similar, very similar. How we write words between languages is similar. So the model can leverage a lot of that. So if you talk about entities, which is typically a problem with most speech-to-text AI systems out there, you use, let's say, let's speak a French speech-to-text AI system, but it's not good with well-known entities, let's say, in America, just because it was never really trained on French corpora that would use lots of entities, location, people's name in the French corpora. We might struggle, I don't know, with Elon Musk, San Francisco, New York, et cetera, right? So it does, actually, so many times. But when you train and merge this, it's just naturally a merge. You can go to a language where you didn't see an occurrence at all of Elon Musk in the training data with that pairing language of this entity, yet it easily recognizes the entity because it learned it from, let's say, English. Hermes Frangoudis (13:03) That's super interesting. Klemen Simonic (13:04) And this This problem of multilinguality is actually even more severe. So because people think, why do we need, let's say, Austrian named entities, well, just think about that you want to say, I'm going to go to Vienna. Graz. Linz. I mean, these are cities in Vienna. And it's legit. It's properly legit. Okay to say this in English. And in fact, you want to say this to Google Maps, but it's not really going to work, right? So the multilinguality seems like that, okay, you don't really need it, but when you actually use speech in the real world, you don't need it. You actually need it all the time. You may want to speak in English, but there's, you will say, apple strudel, right? We all know what this word, I mean, what it means, but you know, do you consider this English? I consider this English. It's like two words maybe borrowed from German language a little bit, but they come totally fit in this context. So multilinguality crosses really. For a given language, it's a mix of many popularity and other terms and entities and even some technical, many technical terms other languages Hermes Frangoudis (14:18) That's super interesting. So the AI does its thing, right? It recognizes patterns, it recognizes similarities and that's super interesting because language, guess, all has similar roots, right? Like there was different languages that spurt off and created new language. So it's finding those similarities and it's able to learn from that, right? Klemen Simonic (14:39) Yeah, exactly. It's kind of hard for us to comprehend this because none of us, I guess, speaks more than a dozen languages, let's say. I think that's pretty fair to say. Most of us speak maybe two three languages fluently. So Imagine that you have a person, a human, that speaks 60 languages with native accuracy, fluently. So This is basically our AI model. And so I think what's happening inside, it's hard to say precisely, the space of languages on the voice and speech side and also the output space becomes very kind of continuous. It's not so many discrete, this is English, this is Spanish, this is French. They all get kind of like in a spectrum, like nicely connected in some ways. They have large amounts of overlap. Hermes Frangoudis (15:26) Yeah, it blends it together. So no, that totally makes sense. Even as a like a Greek speaker, you can see certain words when you hear a native American speaker or English speaker, they're not pronounced the right way versus some native speakers will pronounce them the right way. So capturing that in a dataset. super cool. What would you say some of the biggest challenges are when you're talking about like this transcription translation? Klemen Simonic (15:43) Exactly, Hermes Frangoudis (15:51) in real time. Klemen Simonic (15:52) So there's a difference between transcription and translation. So maybe we can separate that one out. And there's also like a problem of offline, batch, and real time. So these are like kind of two completely different things. We can take them separately, if that makes sense. So maybe let's start first with the offline and real time. Hermes Frangoudis (15:54) Yes, yeah. Klemen Simonic (16:13) So streaming, low-latency is the real-time part. So this is another thing that Soniox has been the best from very early, from when we started basically the company. We immediately started working on real time speech-to-text, which really means that you are streaming the audio into the neural network. And you have to, with some very low-latency, very quickly, very fast, generate some work, parts of the work. And while the audio keeps coming in, you have to be generating words. So this is much different than what, for example, ChatGPTs or other LLMs, text LLMs are. Because what you do with others is like you provide all the context. You ask, 'hey, how are you?'. It has the entire thing. And then it generates an entire response. So you don't have this concept of bidirectional streaming, okay? You provide entire input and the entire output is generated. And then you again provide entire input and output is generated. So for streaming with low-latency, it's really much, much more difficult because you have this, you you have to basically make decisions in low-latency setting in real time. So basically speech-to-text AI system is in some way, I say sometimes to the team, like a robot. It's basically a robot. It gets in a stream of raw audio, and it needs to make decisions in real time. It doesn't steer the car, but it does produce as tokens. It does produce as words. And how do you know this robot, this speech-to-text AI system is accurate? It's good if it's producing the right words with low-latency so as soon as possible. So the streaming problem. is the real-time aspect of this is really, it's a lot more challenging how to do it well, especially with these modern architectures with transformers and LLM-like architectures. It is really difficult to do this accurately, well. So we have to do a bunch of innovations here just to train transformers to become streaming transformers effectively. So that's the files, the offline part, and the real time. The transcription and translation are really two quite different problems. So in our speech-to-text AI model, speech-to-text API, we don't offer just transcription, real-time transcription. We also offer speech translation, real-time speech translation. And we saw a big opportunity here to improve upon what everything has been done so far. So if you use Google Translate or other kind of real-time speech translation services, you will see they're basically translating with a relatively large delay. So basically, you first see a transcription, and then it waits until the end of the sentence. And only then the translation occurs. This really breaks the conversation flow because you miss, I don't know, 10, 15 seconds before you get anything translated. So we said, I mean, we can probably do translation much, much, much faster. We have to somehow basically do it mid-sentence. Think of it like in chunks. As I'm talking, I can also do the translation of that with some a little bit delay, but I can do it. And maybe it's a one-second, two-second delay. So we trained the model in the same way. So the model has been trained to basically do translation as the person speaks. So you'll get a transcription and you'll get immediately a translation, but not wait until the end of it. So you say a little bit of words and then translated words, a little bit more words and translated words. So, and then it's really interactive, like for the first time. So you can just fluently talk to another person in another language back and forth. There's almost no, there's basically no delay on our side. Hermes Frangoudis (20:01) No, that's huge. So it's taking the real time versus we'll leave batch to the side. Just thinking of it in real time. You're making almost like two predictions, right? First is the prediction. What's the next token? What's the next word? But then you have to reevaluate as more context comes in because different languages, obviously they move around the words, they move around the bit. So it's doing Klemen Simonic (20:06) Yeah. Hermes Frangoudis (20:24) two predictions then, as it does the translation? Klemen Simonic (20:26) Yeah, kind of. So I mean, this is exactly what's happening, like you say, Hermes. So because it's a streaming model, it's a real-time model, as it gets more context, more input, more audio, like when I'm saying beautiful, right? So as I'm getting more of, as I'm giving more of the signal, more of the input, the model is better at recognizing this word effectively. So, and when I have of fully finished speaking the word, will predict that word much, much more accurate in between. And this holds for active speech. That's how transcription is done. And then translation is done after the transcription of a few words is completed, and it's assessed with high accuracy. The model does this on its own. It's been trained like this. now this transcription is very accurate. And it's the right time to do translation because it's not always the right time to do translation, right? You cannot do translation every three, four, five words. For some languages like German or Korean, you would completely break the translation meaning because the words completely shuffle when you do translation. So basically, the model has been trained to listen and wait until it actually can create a high-quality translation. So that's built into the model. And that's how you get real-time transcription with high accuracy as well as really high quality translation Hermes Frangoudis (21:49) No, super interesting. And that's what gives you that kind of trade-off, right? Between the latency and the accuracy is like, how do you time that perfectly so that it doesn't lose meaning? Klemen Simonic (21:59) Exactly. that goes into the problem of real-time and tying the speech with text and et cetera. Hermes Frangoudis (22:08) That's huge. Because this is a multilingual model and multilingual people don't always speak in one language, right? How does Soniox handle this language speaking, right? Because when I speak to my parents, I'll speak in half Greek, half English, whatever the word is shorter or simpler in meaning, right? Klemen Simonic (22:26) Yeah, people switch languages like mid-sentence or just quite often. Especially if your native language, if your first language is not just English, then you would switch languages a lot. And at least between two or maybe three languages. This happens all the time. Like you just mentioned Greek plus English or English plus Greek. However you want to order it. I mean, this is especially known for Spanish, right? Spanglish or Hinglish, Hindi plus English. And really, you've seen what happened with, you know, the K-pop, with the Korean, right? The culture in the last decade or so. And it's become really mixed with English. So, because we have one model, that recognizes all 60 languages, we had the ability to basically recognize these language switches at any point. The model handles them automatically. So we can speak in English and at any time just throw in a few Greek words and then continue in English and maybe go to Italian and then French, maybe to Slovene and then to English, maybe Korean, Chinese. So you can mix them. And the model just follows you. So this is maybe one way to think about why this universal one AI model is so important. It's because it basically really takes away the assumptions about the legacy or the traditional speech-to-text AI, speech AI model. And the more assumptions you take away and the closer you get to the real world situations, use cases, what's happening with when people speak, the more useful technology becomes. And I mean, we've seen this with ChatGPT. So ChatGPT doesn't say, 'I'm gonna type now in English, and you can type only in English. You're restricted'. No, you can just type in whatever you want. And you don't have just specific task to solve. You can solve any task with these kind of tools. So it's very generic. It takes away these limitations of constraints. It's just very universal. And basically, the same thing happens here. And to tie it back to the language switching, it takes away this assumption that, ok I'm going to only and strictly speak in this language. And in fact, no, we can just speak in multiple languages, and the model will follow you where you are. Hermes Frangoudis (24:53) That's huge because that's what really makes it feel speaker native, right? Or native-speaker. You mentioned use cases. So that's, I think a great segue into the next subject is what use cases or industries are really getting the most value from Soniox we're seeing the most interest from. Klemen Simonic (24:56) Yeah. Yeah. There are so many, Hermes. So yeah, I mean, speech and words is really quite universal technology when you do it right, when you nail it. And I can now confidently say we did something really amazing with our V3 model, with our universal V3 model. And now with our release of V4. Because we see the adoption and usage all around the world. So we're super excited about this. And the use cases pop up. Even new use cases pop up, I would say, maybe every week or every other week or so. But what would be the biggest use cases? So one is for sure voice agents. So voice agents is now kind of taking off. So that's a big, and we can talk more about that. That's one big use case. The other big use case are wearables. So smart glasses, pins, watches, they are also taking off. There are quite a few companies now that make very compelling products, smart glasses, that you can wear. Basically all day long, the battery, and they need speech-voice interface. And so, you know, we see many of them are being powered by Soniox Then another big use case is healthcare. So many, many customers are, not just in US, also like in Europe we see now and in Asia, are now leveraging this high, high accuracy for the healthcare applications. So instead of the doctor, know, basically, you know, typing, they can dictate or type with their voice. That's one use case. And the other use case is just ambient healthcare AI, where the doctor and patient talk to each other and the AI listens in the background. And, you know, you need to against the speech, you know, really accurate speech understanding. Yeah, and there's more. There's more. Let me stop here. Hermes Frangoudis (27:06) Yeah, no, no, that's That's huge too. It makes sense. Agents is just exploding left and right. The ability to speak to the AI in your most natural form is exploding. And then healthcare, I can totally see that. There's so many instances where there needs to be a transcription, a record, how did it happen, what happened, what are the notes? Klemen Simonic (27:11) Mm-hmm. Hermes Frangoudis (27:30) It's huge if that helps there because there's so much time lost to just putting out. Klemen Simonic (27:34) Yeah, it It really boosts the productivity of the doctors, of the staff, medical staff, and speeches at the very core of this optimization process. Hermes Frangoudis (27:44) So when we're talking about, because your models are multilingual, right? How does this real-time translation change the experience for multilingual teams or audiences that are not native to the speaker's language? Klemen Simonic (28:01) Yeah, mean, with our real-time translation, really broke this language barrier. So now you can have a conference, OK, where you have many, where you have basically, attendees from many different languages that might not be so fluent, let's say, in the language that the speaker speaks. It could be even English. People from all kinds of worlds, they are not necessarily fluent or really comfortable precisely understanding the speaker, even if it's in a popular language like English. And so we see now that some conferences are now being translated into many different other languages with Soniox. And this is live, completely live. as a person speaks, he's getting a presentation, it could be event, conference, whatever. There would be TVs or other kind of projections to do multiple translations in parallel. So that's something that we've seen. Another thing that we've seen is we've built a mobile app, Soniox and people often just use it and they may be at some event or with a family, multilingual, like another, yeah, maybe I should mention this, multilinguality, family with multilingual members is another. Another kind of big use case. I didn't know about this. I guess I wasn't exposed to it, how prevalent it is, but it's actually huge. So people might get married, the partners might get married with two different families with two different languages. And there is no communication between them because they don't necessarily speak the common language. And now we've seen so many examples where they use the Soniox app, the mobile app. And they just converse because the real-time translation, bidirectional real-time translation is right there for one language and another, et cetera. This is really breaking the language barrier experience. And it's there to me because my sister has married into a French family. And she's using it, and she's just had something, a great experience. Hermes Frangoudis (30:04) No, that's huge. We'll have to use that next time I take my family to visit in Greece. Klemen Simonic (30:10) Yeah, you should. Yeah, Let me know. Hermes Frangoudis (30:12) Yeah, no, I've been using the Soniox app. It's amazing. Like I've had multiple times where I travel for work, right? And so you get into a Lyft, the driver doesn't always speak the language. Klemen Simonic (30:12) Let me know. I'd love to know. Yeah. ⁓ Hermes Frangoudis (30:23) And sometimes it's a long ride to be sitting there in silence. And I find myself pulling out the Soniox app and using it to be able to like communicate and speak. It's like you said, it breaks down that barrier in a way that five, six years ago, I just like couldn't see being broken down. Klemen Simonic (30:38) Yeah, yeah, see another useful case. Thanks. Hermes Frangoudis (30:41) Yeah, of course. Speaking of unexpected and creative ways, is there any other way that like you're seeing customers or just people use Soniox that you maybe you didn't even imagine? Klemen Simonic (30:53) Yeah, I've been struck by this, how important this is for people with hearing disabilities and in the deaf community? So we received, you know, and continue to receive just how transformative, basically, having an AI software that can listen carefully to every word being said and then they being able to read it because otherwise they just don't hear it or they hear just a portion of it and they're not really able to interpret it. I mean, one of the, you know, feedback that we've gotten from our users was, you know, this guy that really vividly described how I asked him, 'can you describe how it changes your life?'. And he said that, now his wife doesn't need to stay at home anymore when he goes outside of the apartment because he can use the Soniox app. He can go to the doctor by himself because now he can communicate. And so it's actually, I think this is really profound. So I'm super proud of the work that we're doing and that's been, you know, projected into this everyday life of people and how they can use it. It's really hard to build extremely useful applications, tools for the world. There's a lot of people with hearing disabilities. This is not a small community. That's another thing that I was surprised when I learned about 10% of population has hearing disabilities, severe hearing disabilities. So there's lots of opportunities for, to just change people's lives here with our app Hermes Frangoudis (32:28) That's amazing. The transformative power of AI to make the world and life more accessible to people, help them out in that daily struggle, right? Because they are not experiencing the world the same way that other people are experiencing, but now they can. That's huge. Klemen Simonic (32:42) Yeah. Hermes Frangoudis (32:45) So earlier in our conversation, you mentioned apps like Google and their translation app. And I've ditched that to go over to Soniox for my translation. But how do you see this in the competitive landscape? Like, how do you compare yourselves? I know Google is kind of Google, but Soniox you guys are disrupting in this area. how do you think of Soniox's position and where it's really going to go competing with these larger providers? Klemen Simonic (33:17) From my experience so far, companies, developers, companies that are serious about building voice, speech-powered applications, they choose the best provider. Actually, they really want to choose the best provider because otherwise it just doesn't work. So if you choose, let's say Google for for most of the languages, it will just not be there compared to Soniox. You will not be able to build that application. You will not be able to go and serve the market or build a product that you want to build. So there's the accuracy, okay? As I said, as simple as it sounds, it's actually really hard to achieve. So there's a lot more to do here. I don't think we're done just yet, okay? So, and we have like one, I would say we probably have the best, not one of the best, if not the best speech AI team in the world. And so we have more to build here. And it's not just speech-to-text. So there are other problems to solve. We're related with voice speech and voice agent experiences. This market really, as I said, just started, if you think about touch screens, right? How they moved from, we used to have keyboards. So we used to have keyboards before touchscreens came out. And now touchscreens are everywhere, right? So, they're in the car, they're in the kitchen appliances, obviously the phone, the laptop, et cetera. So the same is going to happen with voice. I think voice will be ubiquitous. It will be really everywhere. And there's a lot of opportunities here to build a big, company. And the key is to focus and solve the right problems. And these are really hard problems to solve, where a small focus team can out-compete any other real thing. So I'm pretty confident about the future of our core AI technology and the advantage and where we can take the company. Hermes Frangoudis (35:25) That's huge. And you're right. A smaller team is always much more nimble. You can, you can adapt, right? And as a team, you also need to make choices and prioritize development. so for your team, is it driven by customer demand? Is it driven by a foresight into where the market is going? Is it a mix of both? How do you kind of like make that prioritization on where to focus? Klemen Simonic (35:51) Yeah, this is really hard, Hermes. Hermes Frangoudis (35:53) Yeah. And we don't have to answer it if you don't want to go into that one because that one is kind of like a tricky balance, right? Klemen Simonic (36:02) Yeah, I mean, I don't think I have like a really good answer. I mean, it's actually a hybrid. It's both, okay? So, and it's a lot about the intuition and the future, the vision that you have for the product, for the speech-voice interactions, particularly. So, you know, someone that has spent a lot of time on this and, you know, might have a better intuition of what is the right thing to do at this time. It's very easy to do the wrong thing at this time. So choosing the right thing to focus, sometimes a lot of it is intuition-based, I have to say. With intuition, I mean expertise from experience over the years. That's what I mean by intuition. Because sometimes the feedback that the user gives you, that you get from the customers, it's not really the right direction. And they don't even know it. They're used to using speech and voice are in a very particular way. But you already see that in a year it's gonna be actually elsewhere. They just don't know it yet. So in these cases, you just have to go and build for, develop for vision. But often customers are right. For example, if something doesn't work very, if the speech AI doesn't work reliably, like in some cases, you do wanna improve that, like really, really quick as soon as possible. So both, both. Take the customer feedback, but not always. Ignore it sometimes and follow the intuition. Hermes Frangoudis (37:29) No, that totally makes sense. This is really kind of where the customer tells you what they're expecting, what they're looking for, and that's for you as a founder to understand really what's that bigger problem that they're trying to solve. So when we think about the bigger problems and where this space is going, how do you see speech recognition really evolving over the next year or two? Klemen Simonic (37:53) I wouldn't say necessary speech recognition, this speech-to-speech experience, so speech-in, speech-out, STT, TTS, and some brain in the middle, this is really going to, I think, get to a point that's gonna be very reliable, robust, and you will be able to converse with an AI like with a human. I think this will happen, you know, like in, let's say in one year time. So how much of the task it will be able to do, that's a separate thing, but at least it will be able to have a very, very human-like interaction, communication for some, you know, some period of time, let's say like, I don't know, five, 10, 15, 20 minutes or so, one session, not like, it will not spend like days or weeks or I think there's more to do there. But this will be the key thing. I'm just making conversing with an AI just like it would be with a human. Hermes Frangoudis (38:48) Really blurring that line between where it is and where it is now. Yeah, no, that makes sense. Earlier you mentioned unsupervised learning. And so do you see that really as like continuing to push the envelope forward along with better data and producing better data in that sense? Klemen Simonic (38:50) Mm-hmm. Yeah, like, look, just like we see with text LLM, like, you know, reinforcement learning, it's kind of taking a life on its own in terms of how you do training and improve the models. The same kind of feedback AI data factory process happens with these multimodal systems where you have speech and text. So, yeah, I mean, this is going to drive the reliability or robustness, accuracy, you know, further down a little bit every time. So this will be the key I get, the key for better systems, at least by the end of this year. Hermes Frangoudis (39:47) That makes sense. And do you think we'll see the models become small enough to shift from cloud to more to the edge? Or do you think it'll continue to be a little bit more of a hybrid? Klemen Simonic (40:00) I mean, you can already deploy models today on the mobile devices, right? So we've been doing some work in this space, and we have fairly accurate on-device models. They are less capable than what's in the cloud just because you're compute-restrained. And if that suits the application that you're going after, then then on-device deployment, on-device model will work. But I think you're always going to have this, once you're used to this insane, beautiful experience of voice and speech which runs in the cloud, it's going to be very hard to achieve the same experience with on-device. Very hard just because of compute constraints, as I said. But there will be complications. I'm sure there will be complications. And also the technology now will advance, I think in terms of compression and various kinds of things so that you will have very, very accurate models, multilingual models also running on the device. Yeah, probably so. Hermes Frangoudis (40:55) No, that makes sense. And you mentioned the amazingness of the cloud models. So in terms of these cloud models, one of the biggest troubles is really overlapping speech, noisy world audio. Where do see the future of that and solving that problem? Klemen Simonic (41:13) I think most of the overlapping speech problem we've solved at Soniox because we've trained the models to basically linearize the overlapping speech. So if you were to speak and I was a little bit to interrupt you, OK, or say one or two words, we would basically just linearize it, transcribe it like this, and then you would continue. So I consider that problem really to be pretty much solved. What's kind of more tricky perhaps is, you know, knowing who spoke precisely. So speaker separation, identification, that robustly solved, that hasn't been solved yet. And it will be really important problem to solve for, let's say, wearables. For example, if you're wearing glasses, it's really important to know that who's wearing the glasses. So if I were to converse, with others while wearing smart glasses, it's important for the glasses to know that when I'm speaking and when others are speaking, for example, this is a simple in space. So there's more on the speaker awareness, speaker understanding, who is actually speaking. I think there's still a space to do that. Hermes Frangoudis (42:21) That makes sense. And there's definitely some improvements, right? With like voice prints and things like that where you can have one person identified. But then, like you said, if I'm wearing glasses, I'm speaking with multiple people. So over the course of the day, understanding multiple different voices and being able to keep that straight is definitely sounds like a huge challenge, but a huge opportunity in the same sense. Klemen Simonic (42:34) Yeah. Yeah, absolutely. Hermes Frangoudis (42:46) You touched a bit about how agents are really exploding and this combination of speech-to-text and text-to-speech with the brain in the middle. Where do you really see that future going? Do you see them continuing to be cascading models that outperform all in one or do you slowly start to see those come into like completely multimodal? Klemen Simonic (43:08) Yeah, that's also a good question. So let me answer like this. I believe that robustness or reliability of voice speech experience is actually the most important thing. And we hear this over and over from our customers. So the reason so many companies are switching over to Soniox systems and so on is because we are really robust and reliable. For example, like alphanumerics, emails, numbers, phone numbers. You cannot get a number, one single digit wrong. It's just game over. So when we talk about accuracy, this is like one single digit accuracy is everything, in one conversation with a customer experience. So robustness and reliability is a must have to go to this automatization process with voice agents. I think that for a while we will still see the three cascading approaches, speech-to-text, text-align, and TTS because each of these components, nailing them to 99.9 reliability is really essential. So I predict that we'll still see the three components for quite some time until we're really able to train end-to-end with this insane, this really highly, highly reliable, accurate experience, product experience, which will definitely happen, no question. It is a harder problem to do. I think from my perspective as an engineer, you solve first, these individual components, link them in a very smart way. You can link them in a very smart way together. Today, there are frameworks out there that you can build really powerful, seamless voice agents already, like Pipecat or LiveKit. So I don't think there's actually anything wrong with the three-part system. It is maybe perhaps less powerful in some ways, but robustness and reliability and accuracy will take first place for writing over other. Hermes Frangoudis (45:04) That makes sense and we see it with our own Agora's conversational AI engine framework where you can put in Soniox and get that cascading flow. The cascading definitely has that, feels like the better accuracy, better timing almost than the all-in-ones because it's as you said the stream is coming in it's making the predictions it's streaming out the text. That little extra bit of stream and low-latency buffer really pushes forward on the agent. So I am curious to see how much better each component can get. Like you said, if you deploy it all together on the same prem, there's certain ways that you can link it and get something that feels very, very low-latency accurate. Klemen Simonic (45:46) Yep, and there's an opportunity here for, as the company that owns all these three components with very, very high accuracy, reliability, robustness, fidelity, to make voice agents like this come to life. Hermes Frangoudis (46:01) Well, Klemen, we're getting to the top of the hour. I know you're a very busy person, so I thank you so much for your time. I have one more question, and this is a bit more of a wild card. If you weren't really working on real-time transcriptions and translations and speech-to-text with Soniox, what other voice AI problem do you think you'd be excited about to look in and try and solve? Klemen Simonic (46:12) Okay. I mean, we are solving the whole voice AI problem. Maybe speech detect seems just one part, but it's really more than that. So I think we're solving the voice, the speech AI problem, let's call it the voice part of the brain at Soniox. So maybe if I were not working on voice, and speech, what else I would be working on? Is that what you meant? Yeah. Hermes Frangoudis (46:54) Yeah, yeah, if you weren't working in this space in voice AI, what are the things would excite you as an engineer? Klemen Simonic (47:01) I mean, so while we are working on voice, we are actually, you know, we are deep into the, you know, other parts of the LLMs and other parts of this tech. We know quite a lot about, you know, other parts of the AI as well. I would be really interested in probably in like more researchy stuff. You know, how to make AI that can explore new things like self-learning over a longer period of time. I've been always interested in this and done a little bit of research at Facebook as well on this. So, this self-evolving learning long-term systems that kind of organize themselves in a way that emerges some more intelligence behavior than if you just train individual components. Hermes Frangoudis (47:50) That move towards the AGI, as some would call it, right? Where the model itself is able to take what it's learned in one area and apply it and teach itself in another. That is super interesting. Klemen Simonic (48:04) Yeah. Yeah. I think that's still pretty unexplored what can be done there. Hermes Frangoudis (48:10) Well, Klemen, thank you so much for your time and thank you to everyone watching along. We'll see you in the next one. Make sure to like, subscribe and follow us for more. Klemen Simonic (48:20) And Hermes thank you so much for having me. Really cool conversation. So appreciate it. Hermes Frangoudis (48:25) Amazing conversation, thank you.