99% Isn’t Enough: How Speechmatics Is Fixing ASR
Loading video...
Show Notes
In this episode of the Convo AI World Podcast, host Hermes interviews Ricardo Herreros Symons from Speechmatics to explore the evolving landscape of speech recognition and its role as the essential gateway to "LLM magic". The conversation delves into the technical hurdles that prevent speech from being a truly "solved" problem—such as background noise, diverse accents, and specialized terminology—and why Speechmatics prioritizes diarization as a critical requirement for security and accuracy in high-stakes industries like healthcare and finance. Key Topics Covered • Speech AI still struggles with noise and diverse accents. • Modern models target collective intelligence, not individual thinking. • Real-time diarization is vital for authentication in regulated industries. • Prioritize first correct word over first byte in performance metrics. • Ambient scribing speeds healthcare adoption by reducing admin friction. • Cascaded architectures are safer and more controllable than direct speech-to-speech.
Key Topics Covered
- •How Speechmatics turns audio into machine-readable text and downstream insights
- •Why speech recognition remains unsolved in the real world: noise, accents, and domain vocabulary
- •Diarization as a meaning layer: speaker identity, security, and correct meeting summaries
- •Beyond words: emotion and paralinguistic signals that improve agent understanding
- •The "bar" for speech AI: matching what all of humanity can do—not a single speaker
- •Runtime custom dictionaries and why they create consistent "wow" moments
- •Speech as the gateway to LLM magic: making LLM capabilities accessible via voice
- •Latency and evaluation: optimizing for the first correct word, not just the first response
- •Ambient scribing in healthcare as a friction-reducer that accelerates adoption
- •Regulation and deployment: concerns around certification, privacy, and on-premise constraints
- •From pilots to full deployment: iterative POCs embedded into customer workflows
- •Why cascaded architectures are safer than direct speech-to-speech for enterprise reliability
Episode Chapters & Transcript
Teaser
Why speech recognition still has a long tail—and how Speechmatics approaches ASR, diarization, and real-world accuracy.
How Speechmatics Started
Ricardo shares the university-project origins story and how Speechmatics scaled from early traction to a global speech platform.
What Does Speechmatics Do?
In plain terms: turn audio into machine-readable, usable text for captions, agent tooling, meeting insights, and more.
The Mission: Ubiquity & Personal Motivation
The goal is everyday ubiquity—paired with a personal motivation for how voice can unlock access when reading and writing aren’t available.
Why Speech Recognition Isn't Solved
It’s not one problem anymore: expectations rise while noise, accents, mumbling, and specialized terminology keep the long tail unsolved.
Beyond Text: Diarization, Emotion & Meaning
Speech is multimodal: who spoke, how they spoke, and the paralinguistic signals behind emotion and meaning matter as much as the words.
The Bar is Humanity, Not a Single Human
The standard isn’t “good for one speaker”—it’s closer to what all of humanity can do, which pushes robustness across languages and dialects.
The Mission to "Understand Every Voice"
Global training data, self-supervised techniques, and an explicit drive to cover accents and voices that traditional datasets miss.
Wow Moments & What Surprises Customers
Customer “magic” moments: difficult terminology, tricky accents, strong diarization, and runtime custom dictionaries that pick up names and product words.
Speech as the Gateway to LLM Magic
Why speech is the interface layer that unlocks LLM capabilities for non-technical users—making “LLM magic” accessible through voice.
The Business Impact of Diarization
Diarization improves summaries and also enables security/authentication flows where the wrong speaker identity would be unacceptable.
Latency: Speed vs. Natural Conversation
Speed matters—but the metric should reflect the time to the first correct word, plus the right pacing to avoid barge-in interruptions.
Surprising Industry Adoption: Healthcare
Ambient scribing is a clear painkiller in time-poor healthcare workflows, driving faster adoption than many expected.
Regulation & Customer Concerns
Ricardo explains how HIPAA/on-premise considerations and device-style certification shape what customers need before deploying in regulated contexts.
From Pilot to Full Deployment
Moving beyond one-off pilots: customers increasingly want smaller commitments, embedded POCs, and deeper integration over time.
How Our Relationship with Voice is Changing
People are shifting from typed input to spoken dictation for LLM-driven workflows—voice becomes a default interface rather than a special feature.
What Excites You Most About Speech AI?
The excitement comes from scale: voice is everywhere, creating a virtuous loop of data, refinement, and broader transformation.
What's Close But Not Quite There Yet?
Speech-to-speech feels close, but enterprise-grade regulated reliability still needs cascaded, controllable architectures—and better emotion handling.
How Humans Are Changing to Fit the AI
As people interact with voice/LLM systems, they naturally adjust how they speak—end-of-turn timing and conversational pacing evolve to help the AI understand.
Click on any chapter to view its transcript content • Download full transcript
Convo AI Newsletter
Subscribe to stay up to date on what's happening in conversational and voice AI.