Back to Episodes

The Science Behind AI Speech Recognition with Deepgram's Andrew Seagraves

Aug 28, 2025

01:06:02

00

Loading video...

Show Notes

Deepgram's VP of Research Andrew Seagraves joins to explore the science and engineering behind modern speech recognition systems. Hermes and Andrew dive deep into why speech recognition isn't a solved problem, the two-stage training process of speech-to-text models, and the challenges of balancing real-time latency with accuracy. The conversation covers Deepgram's origins from dark matter research, power laws in speech data, buffer-based architectures for real-time transcription, and frontier challenges like multilingual code-switching, emotion detection, and conversational dynamics. Andrew shares insights on model deployment, customer use cases from NASA to food ordering, and the future of self-adapting speech models.

Key Topics Covered

•Deepgram's origins from dark matter research to speech recognition
•Why speech recognition isn't a solved problem
•Two-stage model training: pretraining and fine-tuning
•Power laws and class imbalance in speech data
•Real-time transcription buffer architectures and latency trade-offs
•Deepgram's data-centric approach vs large model competitors
•Speech recognition applications across industries (healthcare, finance, food ordering)
•Synthetic data generation for low-resource languages
•Customer use cases: NASA astronauts and noisy drive-throughs
•Model deployment challenges and edge computing
•Multilingual speech and code-switching complexity
•Audio intelligence and emotion detection in speech
•Conversational dynamics: end-of-turn and start-of-speech detection
•Personalization and self-adapting speech models
•Future of federated learning in speech recognition

Resources & Links

→ DeepgramReal-time speech recognition API with industry-leading accuracy and speed → Deepgram DocumentationComprehensive documentation for integrating Deepgram's speech recognition APIs → Deepgram PlaygroundTry Deepgram's speech recognition API in your browser → Deepgram ConsoleSign up for a free account and try Deepgram's speech recognition API in your browser → Agora Conversational AI EngineThe industry's most powerful and flexible platform for building conversational AI.

Episode Chapters & Transcript

00:07

Welcome and Origins of Deepgram

Hermes introduces Andrew Segraves from Deepgram, who shares how the company emerged from dark matter research and evolved into a leader in speech recognition.

01:01

Early Use Cases and Speech Market Fit

Andrew explains how Deepgram's models initially gained traction in call center applications due to domain-specific data and high transcription needs.

03:42

Why Speech Recognition Isn’t Solved

Andrew dispels the myth that speech recognition is a solved problem, emphasizing the ongoing challenges with rare words, localized terms, and non-English languages.

06:25

Balancing Real-Time Latency and Accuracy

The discussion shifts to Deepgram’s approach for achieving low-latency real-time transcription without sacrificing accuracy, including model size trade-offs.

08:39

How Speech-to-Text Models are Trained

Andrew walks through the two-stage process of model training—pretraining on web-scale data followed by fine-tuning on high-quality, domain-specific transcripts.

12:45

Power Laws and the Long Tail of Speech

A deeper dive into Zipf’s law and class imbalance, and how over-represented common words and underrepresented rare terms affect speech model performance.

16:45

Design Trade-offs in Real-Time Systems

Andrew explains the buffer-based architecture behind real-time transcription, the cost of waiting for accuracy, and forward prediction as a future technique.

22:43

Deepgram's Architecture and Competitive Edge

Deepgram’s data-centric approach and small transformer-based models are contrasted with large, slow models. Andrew details their early adoption of encoder-decoder models like Whisper.

26:14

Speech Recognition Across Industries

Deepgram models are seeing adoption in industries like healthcare, finance, and food ordering. Andrew explains how models handle varied environments but struggle with domain-specific vocabulary.

30:40

Improving Models for Non-English Speech

Andrew shares Deepgram’s strategy to generate synthetic training data for low-resource languages and environments, reducing dependency on human labeling.

35:42

Customer Use Cases and Real-World Challenges

From NASA astronauts to noisy drive-throughs, Deepgram adapts models to tough acoustic conditions and shares why real-world audio still breaks open-source models.

41:41

Model Deployment and Real-Time Orchestration

Andrew discusses challenges with running speech models at the edge, why Deepgram works well out of the box, and how real-time orchestration differs from batch transcription.

45:47

Multilingual and Code-Switching Complexity

Modeling bilingual and code-switched conversations is a frontier challenge. Andrew explains why it's hard and how localized, group-specific behavior adds complexity.

49:40

Audio Intelligence and Emotion Detection

The future of audio intelligence involves understanding emotion, speaker state, and background audio—requiring new models trained on multi-modal data like video.

54:00

Conversational Dynamics: End of Turn and Start of Speech

Modeling the timing of human conversation—knowing when to stop or start talking—is still unsolved. Andrew likens it to LLM alignment and the social subtleties of interjection.

58:46

Personalization and Self-Adapting Models

The episode closes with a look at federated learning and Deepgram’s work on real-time adaptation to user speech, aiming for models that learn from feedback instantly.

Click on any chapter to view its transcript content • Download full transcript

Convo AI Newsletter

Subscribe to stay up to date on what's happening in conversational and voice AI.

Loading form...

✓ Conversational AI news✓ No spam, ever✓ Unsubscribe anytime

Tags

#speech recognition#deepgram#andrew seagraves#speech-to-text#real-time transcription#model training#latency vs accuracy#multilingual ai#code-switching#emotion detection#conversational dynamics#federated learning#synthetic data generation#power laws#zipf's law#transformer models#edge computing#voice ai#conversational ai