Back to Episodes

Relatability Over Perfection in Voice AI with Rime's Lily Clifford

Nov 12, 2025

00:54:24

Loading video...

Show Notes

In this episode of the Convoy AI World Podcast, Hermes Frangoudis interviews Lily Clifford, CEO of Rime AI, discussing the evolution of voice AI technology. They explore the importance of high-quality data, the impact of linguistic nuances on voice models, and the challenges of creating relatable and multilingual voice agents. Lily shares insights on customer experience, the role of R&D in meeting market demands, and the future of conversational voice agents. The conversation highlights the technical bottlenecks in voice AI and the ongoing quest for more human-like interactions in voice technology.

Key Topics Covered

•From linguistics to voice AI: how language science shapes model design
•The critical role of high-quality data in training effective voice models
•How linguistic nuance impacts model accuracy and user perception
•Why relatable, human-sounding voices outperform polished synthetic ones
•The future of multimodal and multilingual voice AI systems
•Customer experience and emotional intelligence as the next frontier in R&D

Resources & Links

→ Rime AIEnterprise text-to-speech platform focused on relatability and pronunciation predictability → Agora Conversational AI EngineThe industry's most powerful and flexible platform for building conversational AI.

Episode Chapters & Transcript

00:00

Highlights

Preview clips featuring key insights from the episode about relatability over perfection in voice AI, including why 'bored' voices convert better than polished ones.

00:48

Introduction

Hermes welcomes Lily Clifford from Rime AI and introduces the episode topic.

01:05

Rime Labs Origin Story

Lily shares the origin story of Rime Labs, from Stanford PhD research in linguistics to building a recording studio in 2020. Discussion covers the importance of speech data, underrepresented groups in datasets, and the early vision for collecting conversational data.

08:27

Building Speech Synthesis

The moment of discovery when Rime trained their first text-to-speech model on conversational data and realized it sounded like a friend, not a voice actor - a breakthrough in natural-sounding speech synthesis.

09:12

Computational Linguistics Influence

How Lily's background in computational linguistics influences Rime's product choices, including detailed annotation of ums, ahs, false starts, and other natural speech patterns that make voices more relatable.

12:17

Relatability vs Perfection

The concept of relatability in voice AI - going beyond realistic to relatable. Discussion on pronunciation challenges, predictability, controllability, and building models that provide confidence scores and allow immediate fixes without retraining.

19:38

Trade-offs in Voice Design

Exploring the trade-offs between expressiveness and controllability in voice models, multimodal approaches using LLMs trained on text, and the future of phonetic representations combined with orthographic text.

25:35

Enterprise Deployments and Learnings

Surprising findings from real-world deployments: the best-performing voices are often the 'bored' ones that sound genuinely human, not overly polished. Discussion on customer preferences and the importance of relatability over perfection.

28:19

Testing and Developer Tooling

Rime's approach to making voice AI easier to ship, including tools that allow pronunciation fixes without knowing the International Phonetic Alphabet. The challenge of building complex voice applications with multiple LLMs and guardrails.

34:07

Multilingual Support Challenges

Overwhelming demand for multilingual support, especially in India. Challenges include data collection, linguistic awareness, code-switching (like Hinglish), and building models for specific dialects like Castilian Spanish and Saudi Arabic.

38:03

R&D vs Revenue Priorities

How Rime balances R&D ambition with revenue goals. In a fast-growing market, R&D often means meeting customer demand, with partners helping identify where demand exists.

40:17

Trends in Voice AI

Discussion on speech-to-speech models - overhyped in timeline but underestimated in long-term impact. The analogy to databases in the 70s-80s: there will be many specialized models for different use cases, not one monolithic solution.

43:24

Future of Conversational Voice Agents

What it will take for truly conversational voice agents: the ability to interrupt users naturally, move beyond turn-based interactions, and maintain emotional context throughout conversations.

45:38

Technical Bottlenecks

The biggest technical bottleneck: lack of state management in emotional character. How the bot responds should depend on what happened three turns ago, maintaining emotional frame of reference throughout the conversation.

47:58

Model Convergence

Future of multimodal models: training LLMs to predict both text tokens and speech simultaneously, creating reasoning models that are drop-in replacements for current LLMs while also generating speech.

50:01

Research to Follow

Shoutouts to researchers pushing the envelope: Google Tacotron team (RJ Skerry-Ryan, Julian Salazar), Andrew at OpenAI, and Shivam Mehta at Netflix.

51:17

Linguistic Pet Peeves

The challenge of uptalk in American English - sentences with periods delivered with rising intonation, and questions with falling intonation. The tension between what real people do and what voice experiences need for control.

53:59

Closing

Final thoughts on the future of voice interfaces and closing remarks.

Click on any chapter to view its transcript content • Download full transcript

Convo AI Newsletter

Subscribe to stay up to date on what's happening in conversational and voice AI.

Loading form...

✓ Conversational AI news✓ No spam, ever✓ Unsubscribe anytime

Tags

#rime ai#lily clifford#text-to-speech#tts#voice ai#conversational ai#speech synthesis#computational linguistics#voice models#enterprise voice ai#multilingual ai#pronunciation#relatability#voice data#speech data collection#phonetic transcription#international phonetic alphabet#code-switching#hinglish#castilian spanish#saudi arabic#speech-to-speech#multimodal ai#llm#voice agents#customer experience#ivr#voice cloning#emotion detection#conversational dynamics#agora