Building a Universal Speech Model: Native Accuracy Across 60+ Languages
Loading video...
Show Notes
In this episode of the Convo AI World Podcast, Hermes Frangoudis interviews Klemen Simonic, founder and CEO of Soniox, who discusses how his team is achieving native speaker accuracy across 60+ languages. Klemen explains how Soniox leverages unsupervised learning and a universal model architecture to handle seamless language switching and real-time, mid-sentence translation with minimal latency. By prioritizing robustness and low-latency performance over traditional cascading models, Soniox enables high-fidelity voice interfaces for healthcare, wearables, and voice agents, while also breaking down significant accessibility barriers for the hearing-impaired community
Key Topics Covered
- •Chunk-based translation allows for natural, interactive real-time conversations
- •The system automatically detects and handles mid-sentence language switching
- •Voice interfaces will eventually become as ubiquitous as touch screens are today
- •Cascading models currently provide more robustness than end-to-end systems
- •Unsupervised learning enables high accuracy for languages with little labeled data
- •Voice interfaces are transforming the healthcare and wearable technology industries
Episode Chapters & Transcript
Introduction
Hermes welcomes Klemen Simonic, Founder and CEO of Soniox, and congratulates the team on their recent first-place evaluation results. Introduction to the topic of achieving native speaker accuracy in speech-to-text.
Soniox Origin Story
Klemen shares his journey from working with machine learning in 2008, joining Facebook in 2015 to restart their speech AI team, and building systems that served hundreds of millions of users. He explains how his background in natural language processing transitioned into speech AI.
The Multilingual Accuracy Challenge
Klemen explains the core problem Soniox set out to solve: achieving native-speaker accuracy across 60+ languages, not just English. The challenge of recognizing minority languages with high accuracy despite having very little labeled data compared to popular languages like English.
Unsupervised Learning and AI Data Factory
How Soniox uses unsupervised learning and an AI data factory approach to create high-quality training datasets from petabytes of audio data. The key difference between training text LLMs (same modality) versus speech-to-text (dual modality: audio input, text output).
Universal Model Architecture
Why Soniox chose a universal single model approach over multiple language-specific models. How the model leverages similarities between languages, handles entities across languages, and naturally recognizes multilingual terms like 'apple strudel' or city names regardless of the primary language being spoken.
Real-Time Transcription and Translation
The challenges of real-time streaming speech-to-text with low latency. How Soniox achieves mid-sentence translation with minimal delay (1-2 seconds) instead of waiting until the end of sentences, enabling fluent back-and-forth conversations across languages.
Balancing Latency and Accuracy
How the model makes two predictions: first for transcription accuracy, then for translation timing. The model learns to wait until it can create high-quality translations, especially for languages like German or Korean where word order changes significantly in translation.
Seamless Language Switching
How the universal model handles code-switching and mid-sentence language changes naturally. Examples of Spanglish, Hinglish, and multilingual conversations where speakers mix languages seamlessly. The model follows speakers wherever they go linguistically.
Key Use Cases: Voice Agents, Wearables, Healthcare
The biggest use cases for Soniox: voice agents taking off, smart glasses and wearables needing voice interfaces, and healthcare applications where doctors can dictate notes or ambient AI listens to patient-doctor conversations for accurate transcription.
Breaking Language Barriers
How real-time translation enables multilingual conferences, family gatherings, and everyday situations like Lyft rides. The transformative impact of bidirectional real-time translation breaking down communication barriers that couldn't be overcome 5-6 years ago.
Accessibility: Helping the Hearing-Impaired
The profound impact on people with hearing disabilities - how Soniox enables them to read conversations they can't hear. A user's story about gaining independence: his wife no longer needs to stay home, and he can go to the doctor alone. 10% of the population has severe hearing disabilities.
Competitive Landscape and Positioning
How Soniox competes with larger providers like Google. Companies serious about voice applications choose the best provider because accuracy is critical - otherwise the application doesn't work. Soniox's focus on solving hard problems where a focused team can out-compete.
Prioritization: Customer Demand vs. Vision
The hybrid approach to prioritization: balancing customer feedback with intuition and vision. Sometimes customers don't know what they need yet, and you have to build for where the market will be in a year. Other times, customer feedback is essential for fixing reliability issues.
Future of Speech Recognition
Klemen's vision: speech-in, speech-out experiences will become very reliable and robust, enabling conversations with AI like talking to a human within one year. The role of unsupervised learning and AI data factory processes in driving further improvements.
Edge vs. Cloud Deployment
The trade-offs between on-device and cloud models. On-device models are less capable due to compute constraints, but suitable for certain applications. Once you experience the cloud's beautiful experience, it's hard to match on-device, though compression technology will advance.
Overlapping Speech and Speaker Identification
How Soniox has solved overlapping speech by linearizing it. The remaining challenge: robust speaker separation and identification, which is crucial for wearables like smart glasses to know who is speaking.
Voice Agents: Cascading vs. End-to-End
Why robustness and reliability matter most for voice agents. The three-component cascading approach (STT, LLM, TTS) will persist until end-to-end systems can achieve 99.9% reliability. Frameworks like Pipecat and LiveKit enable powerful voice agents today.
Wild Card: Self-Evolving AI Systems
If not working on voice AI, Klemen would explore self-evolving learning systems that organize themselves to emerge more intelligent behavior. The path toward AGI through systems that can learn and adapt over longer periods of time.
Closing
Closing remarks and thanks to Klemen for the conversation.
Click on any chapter to view its transcript content • Download full transcript
Convo AI Newsletter
Subscribe to stay up to date on what's happening in conversational and voice AI.