Loading video...
Deepgram's VP of Research Andrew Seagraves joins to explore the science and engineering behind modern speech recognition systems. Hermes and Andrew dive deep into why speech recognition isn't a solved problem, the two-stage training process of speech-to-text models, and the challenges of balancing real-time latency with accuracy. The conversation covers Deepgram's origins from dark matter research, power laws in speech data, buffer-based architectures for real-time transcription, and frontier challenges like multilingual code-switching, emotion detection, and conversational dynamics. Andrew shares insights on model deployment, customer use cases from NASA to food ordering, and the future of self-adapting speech models.
Hermes introduces Andrew Segraves from Deepgram, who shares how the company emerged from dark matter research and evolved into a leader in speech recognition.
Andrew explains how Deepgram's models initially gained traction in call center applications due to domain-specific data and high transcription needs.
Andrew dispels the myth that speech recognition is a solved problem, emphasizing the ongoing challenges with rare words, localized terms, and non-English languages.
The discussion shifts to Deepgram’s approach for achieving low-latency real-time transcription without sacrificing accuracy, including model size trade-offs.
Andrew walks through the two-stage process of model training—pretraining on web-scale data followed by fine-tuning on high-quality, domain-specific transcripts.
A deeper dive into Zipf’s law and class imbalance, and how over-represented common words and underrepresented rare terms affect speech model performance.
Andrew explains the buffer-based architecture behind real-time transcription, the cost of waiting for accuracy, and forward prediction as a future technique.
Deepgram’s data-centric approach and small transformer-based models are contrasted with large, slow models. Andrew details their early adoption of encoder-decoder models like Whisper.
Deepgram models are seeing adoption in industries like healthcare, finance, and food ordering. Andrew explains how models handle varied environments but struggle with domain-specific vocabulary.
Andrew shares Deepgram’s strategy to generate synthetic training data for low-resource languages and environments, reducing dependency on human labeling.
From NASA astronauts to noisy drive-throughs, Deepgram adapts models to tough acoustic conditions and shares why real-world audio still breaks open-source models.
Andrew discusses challenges with running speech models at the edge, why Deepgram works well out of the box, and how real-time orchestration differs from batch transcription.
Modeling bilingual and code-switched conversations is a frontier challenge. Andrew explains why it's hard and how localized, group-specific behavior adds complexity.
The future of audio intelligence involves understanding emotion, speaker state, and background audio—requiring new models trained on multi-modal data like video.
Modeling the timing of human conversation—knowing when to stop or start talking—is still unsolved. Andrew likens it to LLM alignment and the social subtleties of interjection.
The episode closes with a look at federated learning and Deepgram’s work on real-time adaptation to user speech, aiming for models that learn from feedback instantly.
Click on any chapter to view its transcript content • Download full transcript
Subscribe to stay up to date on what's happening in conversational and voice AI.