Speech to Text Masterclass: The Tech Behind the Modern Voice Pipeline
Loading video...
Show Notes
In this special montage episode of the Convo AI World Podcast, host Hermes Frangoudis brings together leading researchers and founders from across the speech‑to‑text space to unpack the voice pipeline from the ground up. The conversation covers how cascading architectures stack up against real‑time speech‑to‑speech systems, why Voice Activity Detection acts as the critical traffic controller, and how enterprises can eliminate model pathologies like hallucinations and omissions through modular “Lego block” integration. Guests from Deepgram, Agora, Soniox, and Rime share hard‑won lessons on achieving near‑native accuracy across 60 languages with self‑supervised learning, taming unpredictable pronunciations in LLM‑driven agents, and why truly human‑like emotional understanding is still around the corner. The episode confronts the persistent myth that speech recognition is a solved problem, spotlighting the long tail of accents, rare words, and noisy real‑world conditions that still break most systems, and makes the case that for regulated, high‑stakes industries the auditable text backbone of cascading pipelines remains essential even as speech‑to‑speech models race toward a more natural future.
Key Topics Covered
- •Speech-to-text accuracy still struggles with accents, rare words, and real-world noise.
- •Voice Activity Detection (VAD) reduces latency and cost in speech-to-text pipelines.
- •Self-supervised learning enables accurate speech recognition across 60+ languages.
- •Phonetic control ensures correct text-to-speech pronunciation for brands and medical terms.
- •Modular speech-to-text architectures let developers mix ASR, LLM, and TTS providers.
- •Cascading pipelines give enterprises an auditable text backbone for speech AI.
Episode Chapters & Transcript
Cascading ASR–LLM–TTS vs speech-to-speech
Hermes frames how conversational pipelines translate speech to text and back, contrasts cascading stacks with voice-to-voice models, and notes transparency and auditing trade-offs.
Deepgram's roots and early ASR bets
Deepgram's guest traces audio-search origins, Dark Matter physicist founders, early call-center traction, and why deep learning plus data pointed toward scalable transcription.
Why speech recognition is not a solved problem
The episode pushes back on solved-problem narratives—English strengths versus multilingual gaps, rare words, acoustic diversity—and connects constraints to real-time versus batch workloads.
Two-stage training, Whisper-style pre-training, and pathologies
Guests unpack large-scale pre-training versus narrow post-training, why Whisper-like stage-one alone breeds hallucinations and omissions, and why streaming ASR has less context.
Voice Activity Detection as traffic controller
VAD drives sentence start/stop, finalizes transcripts for the LLM, enables interruptions, and running tiny models on-device saves latency, bandwidth, and STT spend.
Phonetics, predictable TTS, and the dialect long tail
Rime discusses enterprise pronunciation control with phonetic annotations, unpredictable LLM-driven agents, expressiveness versus controllability, and sparse high-quality voices across locales.
Agora orchestration as Lego-block integrations
An Agora-focused segment walks SIP/WebRTC plumbing, plugging Whisper or vendor STT/TTS backends, dial-in versus apps, and upgrading legacy call flows with agentic stacks.
Multilingual parity, data factories, and regulated cascading
Soniox covers self-supervised paths across sixty-plus languages and petabyte-scale synthetic data workflows; closing voices argue long-tail realism costs, hybrid speech-to-speech plus cascaded guardrails, and emotional nuance still missing.
Click on any chapter to view its transcript content • Download full transcript
Convo AI Newsletter
Subscribe to stay up to date on what's happening in conversational and voice AI.