Speech to Text Masterclass: The Tech Behind the Modern Voice Pipeline

May 14, 2026
00:57:06

Loading video...

Show Notes

In this special montage episode of the Convo AI World Podcast, host Hermes Frangoudis brings together leading researchers and founders from across the speech‑to‑text space to unpack the voice pipeline from the ground up. The conversation covers how cascading architectures stack up against real‑time speech‑to‑speech systems, why Voice Activity Detection acts as the critical traffic controller, and how enterprises can eliminate model pathologies like hallucinations and omissions through modular “Lego block” integration. Guests from Deepgram, Agora, Soniox, and Rime share hard‑won lessons on achieving near‑native accuracy across 60 languages with self‑supervised learning, taming unpredictable pronunciations in LLM‑driven agents, and why truly human‑like emotional understanding is still around the corner. The episode confronts the persistent myth that speech recognition is a solved problem, spotlighting the long tail of accents, rare words, and noisy real‑world conditions that still break most systems, and makes the case that for regulated, high‑stakes industries the auditable text backbone of cascading pipelines remains essential even as speech‑to‑speech models race toward a more natural future.

Key Topics Covered

  • Speech-to-text accuracy still struggles with accents, rare words, and real-world noise.
  • Voice Activity Detection (VAD) reduces latency and cost in speech-to-text pipelines.
  • Self-supervised learning enables accurate speech recognition across 60+ languages.
  • Phonetic control ensures correct text-to-speech pronunciation for brands and medical terms.
  • Modular speech-to-text architectures let developers mix ASR, LLM, and TTS providers.
  • Cascading pipelines give enterprises an auditable text backbone for speech AI.

Episode Chapters & Transcript

0:00:00

Introduction: A speech-to-text masterclass

Hermes introduces this montage episode on how machines turn audio into text for LLMs and AI systems, featuring researchers and founders across the STT stack.

0:00:32

Cascading ASR–LLM–TTS vs speech-to-speech (Ben Weekes)

Agora's Ben Weekes breaks down the voice pipeline—STT, LLMs, and TTS—contrasts cascading stacks with voice-to-voice models, and explains transparency, provider mix-and-match, and auditing trade-offs. (Ep. 001)

0:05:13

Why speech recognition is not solved (Speechmatics)

Speechmatics' Ricardo Herreros Symons pushes back on solved-problem narratives: the growing long tail, real-world failure modes, accents, multilingual tradeoffs, and efficiency at scale. (Ep. 024)

0:07:57

Two-stage training, Whisper, and real-time pathologies (Deepgram)

Deepgram's Andrew Seagraves walks through pre-training vs post-training, why Whisper is only stage one, model pathologies like hallucinations and omissions, and why streaming ASR is harder than batch. (Ep. 005)

0:14:10

Agora orchestration as Lego-block integrations (Colaberry)

Colaberry's Ram Katamaraja shows enterprise SIP/WebRTC plumbing, plugging Whisper and vendor STT/TTS backends, dial-in flows, and upgrading legacy call centers with agentic stacks. (Ep. 012)

0:20:22

Phonetic control and the path to 100% accuracy (Rime)

Rime's Lily Clifford on predictable pronunciation with phonetic annotations, why unpredictable LLM-driven agents break pre-QA, and the enterprise thesis that teams need a path to full accuracy. (Ep. 013)

0:24:01

Voice Activity Detection as traffic controller (TEN VAD)

TEN Framework's Ziyi Lin explains how VAD detects sentence start/stop, triggers the STT–LLM–TTS chain, handles interruptions, and how edge deployment saves latency, bandwidth, and STT cost. (Ep. 007)

0:27:27

Voice as a health biomarker (Thymia)

Thymia's Emilia Molimpakis on voice's power for mental and physical health signals, ubiquity and privacy advantages over video, isolating condition signatures from demographic noise, and multilingual coverage. (Ep. 020)

0:30:15

The biggest misconceptions about speech recognition (Deepgram)

Andrew Seagraves dispels the myth that ASR is solved—English strengths vs multilingual gaps, rare and localized words, and why word accuracy remains an underappreciated core challenge. (Ep. 005)

0:32:07

Native accuracy across 60+ languages (Soniox)

Soniox's Klemen Simonic on building real-world multilingual ASR, leveling accuracy across languages with self-supervised pre-training when labeled data is scarce, and native-speaker parity at scale. (Ep. 021)

0:36:01

Speech-to-speech vs regulated cascading (Speechmatics)

Ricardo Herreros Symons on speech-to-speech feeling close but not enterprise-ready, why text guardrails matter in regulated industries, emotion gaps, and hybrid model-switching futures. (Ep. 024)

0:38:04

Balancing real-time latency and accuracy (Deepgram)

Andrew Seagraves on real-time vs batch transcription, constrained design spaces for speed and scale, and how higher-quality localized data enables smaller, more accurate models. (Ep. 005)

0:40:38

The dialect and locale long tail (Rime)

Lily Clifford on sparse high-quality voices—Castilian Spanish, Arabic dialects, Saudi vs Moroccan intelligibility—and why colloquial, locale-faithful data collection never stops. (Ep. 013)

0:43:50

Deepgram's origins: dark matter to ASR (Deepgram)

Andrew Seagraves traces Deepgram from dark-matter physicist founders and personal audio corpora through YouTube-scale audio search to early call-center traction and end-to-end deep learning conviction. (Ep. 005)

0:47:05

Expressiveness vs controllability in TTS (Rime)

Lily Clifford on the trade-off between highly expressive and controllable voice models, LLM-backed multimodal TTS, phonetic post-training, and frontier work marrying richness with control. (Ep. 013)

Click on any chapter to view its transcript content • Download full transcript

Convo AI Newsletter

Subscribe to stay up to date on what's happening in conversational and voice AI.

Loading form...
✓ Conversational AI news✓ No spam, ever✓ Unsubscribe anytime

Tags

#speech-to-text#asr#automatic speech recognition#voice activity detection#vad#multilingual asr#self-supervised learning#phonetic control#text-to-speech#tts#cascading pipeline#speech-to-speech#enterprise voice ai#conversational ai#voice ai#real-time transcription#model hallucinations#audio pipelines#lego block integration#regulated industries