Hermes Frangoudis (00:00)
This is not the thing that presents well in the boardroom, but on the ground, this is what converts and works.

Lily Clifford (00:05)
Like I promise you that if you put this voice into production like 20 % of people's be saying thank you and right now 0 % of people are saying Thank you. That's even though you would not think that was true

Hermes Frangoudis (00:15)
And that's what I want to highlight in this. So it's the boring voices, the ones that aren't overly acting, but sound the most human, right? That's the advice. Don't be afraid to try those out and see how they can convert.

Lily Clifford (00:27)
Yeah. And with that said, like every use case is different. Like I think there's many use cases in which like the polished, you know what mean? So that's, it's not necessarily the takeaway is like always the board Gen Z voice is like converting better, but like people should have an open mind.

Hermes Frangoudis (00:48)
This is the Convo AI World Podcast where we interview the founders and builders pushing the voice AI space forward. And today I am so thrilled to have a legend, Lily Clifford from Rime AI, Rime Labs. Thank you so much for joining me today, Lily.

Lily Clifford (01:04)
Thanks Hermes, I'm excited.

Hermes Frangoudis (01:05)
Okay. So let's kind of take it back to the origin story. Like what, what really spawned the idea for Rime Labs? Like what motivated you to leave Stanford and start Rime.

Lily Clifford (01:17)
Yeah, back in 2018, I started my PhD at Stanford in Linguistics Department. the time, I actually didn't join with the intention of working on deep learning speech processing. I was an acoustic phonetician. And at the same time, like if you're at Stanford in the Linguistics Department and you're working on speech data and it's

you know, the advent of transformer-based approaches to doing speech processing. And I think it was like 2019, wav2vec 1.0 and the wav2vec 2.0 came out, which are like these large semi-supervised approaches doing speech recognition out of Facebook and

I was hanging out a lot with my cohort mate, Nay San, who is a speech technologist who is now on our team. I don't know, it was a crazy time to be like working in speech whatsoever. And I started doing research on voice biometrics problems. So like you train a neural network to classify someone's age or gender or their

likelihood of having, for example, like early-onset Parkinson's actually like you can train a neural network to, you know, reasonably accurately care, like classify whether someone is like going to develop Parkinson's for example. I've heard about

Hermes Frangoudis (02:36)
I've heard about

that kind of stuff. That's the technology where like using the speech it can detect the little impediments and issues that are signs of like the future

Lily Clifford (02:45)
Yeah, it's a classification problem, just like any other classification problem. If you have the data that you can train a classification system to classify, that's all it is. And those signals are not perceptible to us, of course, but are detectable. And at the same time, if you train a neural network to classify someone's gender, for example, even just from two to three seconds of them talking,

It's like 99.8% accurate, for example. And at the same time, if you have it generate predictions for real world data, having been trained on like all this open-source speech data and the open-source speech data that exists, essentially people like volunteering to read audiobooks online. So you can imagine like the kind of people who volunteer to read audiobooks online are very particular kind of people. Interesting crowd, right? Yeah, exactly, exactly. And that's like the basis of all speech research, by the way,

Hermes Frangoudis (03:32)
Interesting crowd, right?

Lily Clifford (03:37)
is these open-source datasets of like this very particular crowd.

Hermes Frangoudis (03:41)
I keep hearing about these, these open-source datasets and how, how much work they need and TLC.

Lily Clifford (03:47)
This TLC, it's also just like, some people take it as like, of course, it's ground truth data. People have spent a lot of time labeling these datasets for use in research and at the same time, right? Like they're very particular kind of people. And so you have it generate real world predictions for real world data. And you find like these systems do less well on African-American English speakers, for example, cause like those are the people who are not in these datasets.

Hermes Frangoudis (04:10)
So like underrepresented groups. It all comes back to the data because it's like been a theme. It all comes

Lily Clifford (04:14)
It all comes back to the data.

And so when we started the company, I was talking a lot with my friend and now co-founder, Brooke Larson, and our mutual friend, Ares Geovanos. So we were like, data is so important to building these models. And so we started by building a recording studio, all still working our day jobs, but we built a recording studio. So it was like kind of crazy thing to do. It was also like 2020 and all of us were like,

"what are we doing with our lives?". Do you know what I mean? I think a lot of people were going through that at the time. Yeah.

Hermes Frangoudis (04:43)
⁓ yeah, very much so. A lot of technology,

a lot of time, a lot of isolation.

Lily Clifford (04:48)
Yes,

and literally spending 24 hours a day with each other. Brooke and Ari were roommates at the time, and they were my friend group. Those are the people I was spending time with. One night, we were literally like, we should build a recording studio. I can see how important speech data is going to be for... And then by the way, at the time, the word 'Frontier Lab' didn't exist. Now you see these companies who are selling data to Frontier Labs.

Like that didn't exist back then, but the idea was like, we knew how important this data was going to be. There was some thought, we could train our own models, but we can also sell this data to like, again, what we would now describe as Frontier Labs, which by the way, like Anthropic didn't even exist. Like OpenAI had not released ChatGPT.

Hermes Frangoudis (05:31)
Yeah, this was pre a lot of this coming out onto the market. It's stuff, only like what? Two, three years, like consumer-wise, like consumer-facing, it's relatively new. So we're talking way before this

Lily Clifford (05:42)
Way before, way before that. Then like before that there wasn't also like much investment in data collection. Cause like as models increasing capability, there is an ever-increasing need for higher quality data. Because everyone has access to the same public data, these crappy open-source datasets, of course, but then also all the data you can find everywhere. But like if I can find data, then someone somewhere else can also find data.

We had this idea that if we collect a large proprietary dataset of not a voice actor reading an audiobook, but someone having a conversation with a friend or family member, that that would be really valuable. We didn't really know how it would be valuable. We didn't know why it might be valuable. We just knew that that data didn't exist at the time. So quite literally, we were working our day jobs and then also collecting 10 hours of data every day

and manning this recording studio and working with consultants who have advised the largest recording studios in the Bay Area how to build vocal booths, except we were doing it for like $5,000 and not a million dollars. And they probably thought we were just insane because we were. They're like, who are these people who building like DIY vocal booths in the basement of a building in SoMa?

Hermes Frangoudis (06:53)
It's that pioneer though, it's that pioneering you that's like, it doesn't matter whatever it takes to kind of get this off the ground, right?

Lily Clifford (06:59)
Yeah. And again, like we had no idea what we were doing with how valuable it might be or if it had value or if anyone was going to care. it was fun though. And it remains fun. Still we're collecting data as we speak. You still have the booth? Yeah. Like again, it's not 24/7, but it's all the time every day. ⁓ And we have recording studios in many places in the world now too. Like we haven't built them ourselves. We contract out recording studios. Point being right. Like the idea at the time

Hermes Frangoudis (07:11)
He started the booth.

off.

Lily Clifford (07:26)
was like, if we were to train models, like what would this data enable us to do? And the way that I often think about it is like, okay, not only did I not know what a Frontier Lab was because it didn't exist, but I also didn't know anything about enterprise Voice AI. I knew nothing about this, right? Like I didn't know what an IVR system was. I didn't know about Agora. I didn't know about WebRTC. I didn't know about SIP. I didn't know what, I didn't know the acronym IVR by the way. For like six months collecting data, I didn't even, I had never heard the acronym IVR.

Right. For people who listening, that's like Interactive Voice Responses to the phone tree you call when you call a business, right?

Hermes Frangoudis (07:53)
enough.

Yep, exactly. So when you call up and it's like 'press one for this', 'press your party's extension'. That's all IVR phone systems, the maze.

Lily Clifford (08:04)
Yes.

The maze and like we all hate the maze and..

Hermes Frangoudis (08:11)
So

You're building Frontier Labs, you're collecting all this voice data. What was kind of like that moment where you're like, 'hey, we could do something with this. Like we can make speech synthesis'. Like how do we get there? How'd you get there?

Lily Clifford (08:27)
I don't really remember. I think at the time it was like, Hey, we collect all this data. need to like, we started in-house annotations, but it's being spun up, at first, it was us like literally typing and what people were saying. And we started training models on top of. at first, we train a speech recognition model to like help us in our efforts to transcribe the data we're collecting, which I had experienced doing. I had never had experience training text-to-speech models, but like,

Hermes Frangoudis (08:46)
Okay.

Lily Clifford (08:50)
At that time, there was also sort of a renaissance and open-source models too for first speech, for the first time ever really. And so we were just expert tinkering and we were tinkering at the end of the day. And you train a text-to-speech model on this data and you're like, wow, I've never heard anything like this before. Like someone, this is sounds like my friend, not like a voice actor.

Hermes Frangoudis (09:02)
and then just.

I'm getting chills. Sounds like one of those moments where you're all kind of sitting around and you're like, wait, that sounds real. That doesn't sound like a robot.

Lily Clifford (09:12)
Hahaha

And by the way, like this

This is at a time when like, if you look at the hyperscalers, like Google, Microsoft and Amazon, they all have one African-American English voice, like one.

Hermes Frangoudis (09:28)
Yeah, there's no flavor to anything.

Lily Clifford (09:29)
and

No flavor whatsoever. And I still say this to this day, these products are Wonder Bread, right? They do the job, they're calories.

Hermes Frangoudis (09:41)
They're also made to sit on the shelf for long periods of time, I guess. Correct. So how does your background in computational linguistics really like influence the product choices you guys made early on and, and differentiating from the Wonder Bread.

Lily Clifford (09:44)
Correct. Correct.

I mean, I think like, if you care really deeply about how language is used, and that's what linguists do is care deeply about how language is used, then you're going to pay attention to things that people who are just building Wonder Bread don't care about. And what I mean by that, right, is like even down to how we annotate data, right? Like when I'm talking with you, I'm not, ⁓

I'm saying, 'um', 'ah' and stuff, right? Like I'm saying, 'um' and 'ah' like that stuff needs to be transcribed. Of course, like it needs to be in the labeled data. If you want to train a speech systhensis model that does those things with any degree of fidelity, to how people use language, right? But at the same time, the amount of times I start saying a word and then stop saying a word, right. Like, or, have a false start and the very particular ways that people have false starts that I wouldn't describe as stuttering necessarily, but like have really rich, deep meaning when I'm like,

'err' you know, like that, that has this sense of like, I'm about to say something. And so you need to label that, right? And like that data has never been labeled. And because people are dealing with scripted content, right? Like you have a voice actor, they read lines, you know what they say in advance. That's it. And then you like have these pairs of the script line and then the audio, right? And you train a text-to-speech model on top of that.

Obviously that works and it's really easy.

Hermes Frangoudis (11:06)
Very vanilla though, right? Like you get just a very, I Know what I'm saying. I know what's being said cookie-cutter factory Wonder Bread, pop-pop-pop in and out right? But you don't get really good detection. You don't get a like emotion or any of that stuff do you?

Lily Clifford (11:21)
I'd say you get emotion, but you get emotion that is present in scripted content, which by the way is really important. And like these models can reproduce with a higher degree of fidelity than they ever have. And like, that's just a case that like, if you're building a voice agent, like, and you want it to be able to say like, 'I'm so sorry to hear that'. Like 'we can definitely help out with that'. Like 'you need that data, right?'. Like sympathy needs to be there for sure. But just to answer your question though, it's like,

Hermes Frangoudis (11:42)
me.

Lily Clifford (11:44)
it's so multifaceted from the kind of data you collect and how you label it. The computational linguistics background comes in from the fact that we know that if we want to model something, it has to be labeled. And then we use our linguistic expertise to label that data appropriately in a way that no one else does.

Hermes Frangoudis (11:59)
So you have the really highly detailed annotations properly catching the ums, the ahs, the stutters, the false starts, like all these little nuanced things that maybe someone that doesn't have this background would maybe not consider like worth labeling or doesn't get labeled in the open-source sets, like you said.

Lily Clifford (12:17)
Yeah, and again, we didn't know that it would have value to an end user of a model. We knew that maybe a Frontier Lab would find it valuable, but then we started training models and we're like, people really like how this sounds. It's very simple, right? People like how it sounds.

Hermes Frangoudis (12:23)
Mm-hmm.

Yeah, sounds more realistic. It sounds more human. People tend to prefer it.

Lily Clifford (12:36)
I often think the axis that people often forget about is like, we often say like realistic or human-like, but we very rarely say relatable.

Hermes Frangoudis (12:45)
I like that word, relatable.

Lily Clifford (12:47)
Cause like, speech models are really good. I mean, they're really good. They have problems, of course. There's these problems with hallucination around consistency and accuracy of pronunciation of these really critical words, right? In business contexts, like prescription names, et cetera. Like there's this long-tail of problems. And at the same time, right? Like speech models are really good. They sound really good. But then the next question is like, is it relatable? Like, is this someone I would want to talk to?

Hermes Frangoudis (13:09)
Yep, that's another one. Like how much do I relate to this voice, connect with it, feel that it doesn't feel cold, I guess. I don't know what the description would be there, but that's an interesting point.

Sorry, you had me thinking there. You mentioned  in the relatability, the ability to pronounce certain words. And I noticed like different voice models will pronounce or mispronounce the same word. Like, even though it would be a word that I would think they should all pronounce the same, it's funny to watch some stutter on it or completely mispronounce it versus others that hit it on the head.

Lily Clifford (13:48)
I think like, you know, if you throw like a list of extremely challenging brand names to any text-to-speech model, you're going to find that some of them pronounce them right? And some of them pronounce them wrong. And that set is different for each model.

At the same time, we're often selling to teams of enterprise developers. We saw the enterprise teams of developers are building really high-volume calling applications for, in many cases, Fortune 100 businesses. And you can imagine, right?

It's one thing to mispronounce a word. It's another thing to not have predictability over whether a word is going to be pronounced correctly or not. And so like a lot of people build models. I would describe them as like general-purpose models, speech models where like they're really good at reading things out. Like, but what you lose from having this, this like less rich, what you lose from having less richly annotated data on the phonetic level, right? Like I'm talking like

having a phonetic transcription system for the data, and that's what you train the text-to-speech model on, not just how the words are spelled, how you and I would spell them. Then what you lose is predictability and control. And what I mean by that is if you send Häagen-Dazs to our API, maybe, I'm not sure, maybe we pronounced it incorrectly today. But we want to build models where

we can tell you A, that we are less confident that we're going to pronounce it correctly. And B, if we're not pronouncing correctly, you can fix it immediately without us having to retrain the model.

Hermes Frangoudis (15:11)
So being able to pass that like

the pronunciation annotation kind of with it if it's not, if it doesn't feel confident enough or yeah.

Lily Clifford (15:19)
And by the way, yes, exactly that.

And at the same time, like this is a feature that's existed in text-to-speech models since the very beginning of text-to-speech models. And at the same time, no one else has built like workflows around telling you, right? Like here are the words we're not pronouncing correctly today. Otherwise, you just have to like call yourself and guess. And by the way, this wasn't a problem like pre LLM, because if you're talking about this IVR maze, right? Like everything's basically like,

pre-generated. You can run QA. Once you run QA, that's it. You're like, everything sounds good. If we're not pronouncing Häagen-Dazs correctly in the phone tree, you pass the international phonetic alphabet, you use the text-to-speech model to create audio, done. But like now in the era of LLMs, you don't even know what the voice agent is saying before it says it.

Hermes Frangoudis (15:59)
It's completely unpredictable. So the ability to pre-QA it has gone out the door, right? So it's more about like how do you pronounce it predictably? How do you ensure? That there is confidence or no confidence.

Lily Clifford (16:03)
Yes.

Correct.

And really this like the Rime thesis is like, if you don't have a path to a 100% accuracy, then enterprise won't adopt. I'm not saying you have to have a 100% accuracy because we're never going to have a 100% accuracy. But if you don't show teams and developers who are building voice agents a path to a 100% accuracy, they can't build a product really. Yeah. They can't build. Right. And so like, if you're, you know, Providence Medical and you're building a voice agent for doing like,

Hermes Frangoudis (16:30)
they're not gonna feel comfortable. Yeah.

Lily Clifford (16:40)
genetic counseling screening, you have no idea what the patient on the other end of the call is going to say, right? Maybe they have family history of cystic fibrosis. You never thought about cystic fibrosis before in your life. And then the voice agent mispronounces cystic fibrosis and you're like, this is the opposite of an empathetic clinical experience. Do you know what I mean?

Hermes Frangoudis (16:59)
Yeah, it goes down the drain real quick right there. It goes down the drain. That's super interesting. So in terms of these nuances in the linguistics, like accents, intonations, the emotional variants, like all of this, is that captured like in the model differently than maybe like a regular TTS, ASR type system, or do you, do you kind of follow standard methods? It's just more about the richness of the data and how good you are with it.

Lily Clifford (17:00)
Don't go down the drain.

It's not rocket science. I would describe it as like, not right. You know what I mean? Like it's not rocket science, like garbage in, garbage out. No, no, it's a common fact I hear. No, no, no, no. But it's, it's worth restating. Like what makes us different is not necessarily that we're like rocket scientists and we have linguistic expertise and like we value that linguistic expertise. And so like, as a result, we foreground it and are the data that we collect and annotate. But at the end of the day, it's just like, that's what we're focused on.

Hermes Frangoudis (17:26)
Okay.

No, no, it's a common fact I hear. Yeah. Yeah.

Lily Clifford (17:52)
And if you're building a general purpose voice model, you don't need, I mean, you might not care. I mean, that's okay. That's like, what I think is most exciting too about modeling, right? It's like, okay, take Sora 2.0, right? Like, did they build Sora 2.0 with the intention that like, if you Hermes are adding your likeness of Sora 2.0, that it's going to like pronounce your last name correctly? No, I don't think so. But that's, but.

Hermes Frangoudis (18:13)
Not at all. And there's so many cases

where it just doesn't, right?

Lily Clifford (18:15)
Yeah, And

so, but I don't think that's what's important about Sora 2. Sad to say, I would be great if it could pronounce your last name

Exactly.

Hermes Frangoudis (18:23)
I'm happy that it gets my first name, but  it's a different story.

Lily Clifford (18:28)
Anyway, you get what I'm saying, right? That's what's exciting is there's different models for different use cases.

Hermes Frangoudis (18:30)
Yeah.

And I've definitely seen it like, depending on what kind of agent you're trying to build or what kind of task you're trying to accomplish. The voice that you're using, the model that you're using all kind of plays into that pipeline, right? So it's like selecting the right tool for the job.

Lily Clifford (18:47)
And like customer experience, and I would say like we're training voice models for customer experience is really like.

And again, this is not a term I knew three years ago, customer experience. But when you look up the definition, it says something like it's the sum total of impressions that a customer or consumer has with your brand, right? And so like different tools for different jobs.

Hermes Frangoudis (19:04)
Yeah. And you want them to have a good experience no matter what kind of job they're trying to do. Exactly. And it all comes down to meeting the customer, meeting their expectation, which is really cool to hear.

Lily Clifford (19:09)
Exactly.

Exactly.

Hermes Frangoudis (19:17)
So what are some of the trade-offs maybe between like more expressive tones or speech patterns versus like recognizability when you're designing some of this stuff? Is it purely just like, 'Hey, we're going to have these different flavors and we're just going to annotate, give them different annotations of data,' like the richness and it all kind of comes out

the way it does or is there something you kind of do in that sauce too?

Lily Clifford (19:38)
There's still a lot of missing pieces, I would say, in like having voice models that are both highly expressive and also controllable. Like there's still such a trade-off there today. And it doesn't have to be that way. I would just say like, again, people are focusing on different things at different points in history. And we're at a moment right now where people have sacrificed controllability for expressibility or for expressiveness.

And again, it's not going to remain that way for very long. Like Rime will train, you know, the most expressive speech synthesis models today are trained on top of large-language models that were, that have only seen text really. It's strange, but true. Like you can take a large-language model that saw 25 trillion tokens of text, just text, right? And then you start showing it text and audio and it learns text and audio. It's crazy, but true.

Hermes Frangoudis (20:14)
Interesting.

So it becomes kind of multimodal.

Lily Clifford (20:29)
It is multi, yes, exactly. Which by the way, like no one would have ever described a text to speech model before is multimodal. Like you predict audio given text, but in that way, like text-to-speeches, like the OG multimodal. It's never not been multimodal what I'm saying, but like to your point, yes, like it definitionally in the way that people talk about large-language models, it becomes multimodal. Yes.

Hermes Frangoudis (20:48)
Yes, okay.

Lily Clifford (20:49)
The benefit

of that is it saw 25 trillion tokens of language. So it basically understands something about language.

Hermes Frangoudis (20:55)
It's able to create those little nuanced patterns in the written language, which can then somehow help reinforce it in the audio pattern.

Lily Clifford (21:00)
Correct.

Crazy but true.

And at the same time, like those 25 trillion tokens of text data are not phonemic phonetic representations, right? They're just how you and I would spell words. And so like, there's still work that needs to be done to like post-train the text-based LLM still on text or phonemic text. And so what you're seeing now is, and essentially this is this is frontier research is like how to take this

Hermes Frangoudis (21:15)
Not at all. Yeah.

Yeah.

Lily Clifford (21:32)
LLM backbone, and then post-train it to learn phonetic representations in concert with these orthographic, the written forms of text language, and then post-train it again to become multimodal. And so that's really the cutting edge. And that's how you're getting a high level of nuance and richness in the spoken language while also getting the controllability that comes with the phonetic representation. So that's the future.

Hermes Frangoudis (21:55)
That's amazing and that's like the bleeding edge right now like that's the frontier that is really pushing the space forward.

Lily Clifford (22:01)
Yeah, it's like Google and Rime basically that are doing that.

Hermes Frangoudis (22:02)
in one direction.

That's awesome. Earlier, we were talking about like the flavors and there's only one flavor, right? The Wonder Bread. How does Rime approach modeling for like underrepresented maybe like dialogues or speech patterns, like things that are not that common, but they exist enough that when you're an enterprise, like you need to account for these edge cases or these things.

Lily Clifford (22:24)
And you would be shocked at like how, like we describe it as a long tail, right? But you would be shocked at tail-like it is such that like you have, right? Like one of the most prominent like agent builders for customer experience in 2025 saying they can't get high quality Castilian Spanish voices, like Spain Spanish voices.

Hermes Frangoudis (22:49)
Yeah.

Lily Clifford (22:50)
Like literally almost impossible. Like, yeah, Latin American Spanish there, right? But like, and why is that? And they have no idea why. Like they tried this model, they tried that model, they're like, they put it in front of like, you know, large enterprise in Spain. They're like, this sounds like someone from Columbia. You know what I mean? And so I like it's such a moving target too, because like languages change all the time.

Hermes Frangoudis (22:53)
And you guys got it.

Lily Clifford (23:11)
And like the colloquial variety of a language changes all the time. And as these AI agents become more capable, expectation that they remain colloquial, right? And like sounds like people talk today will be ever increasing. And so you have to collect that data. I don't know. And you have to like train models that are purpose-built for fidelity to how people expect that people talk. And at the same time, like not only as, mean, there, by the way, there exists

like Castilian Spanish models from four or five years ago, people just don't like them anymore, right? Like taste is ever changing. And with that said, like there is essentially no high quality Hindi text-to-speech model in existence today. There is essentially no high quality like Arabic text-to-speech model in production. Why is there no high quality Arabic text-to-speech model that exists? Because people built Arabic text-to-speech models for what's called modern standard Arabic, which is a,

form of Arabic that no one speaks natively. It's like what every audio book would be narrated in. It would be like as if we were reading Shakespearean English all the time. And that was the only thing that existed in datasets of English. And then we're talking right now and we're obviously not talking like Shakespeare. And so like,

Hermes Frangoudis (24:05)
interesting.

I mean, we're not even

talking about like traditional English. We're talking like American dialect of English, right?

Lily Clifford (24:23)
No, yeah, so like the first-order problem is just like having, you know, Saudi Arabic, like generic Saudi Arabic.

Hermes Frangoudis (24:31)
And then moving down into the different regional.

Lily Clifford (24:32)
And then moving down, right?

Cause like we've proven that like, if you talk to a fast-casual restaurant voice agent in Atlanta and you hear an African-American English voice that you're likely to have completing that order increases.

Hermes Frangoudis (24:45)
Just because of the relatability.

Lily Clifford (24:47)
The relatability, exactly. so like,

is there anything like that happening in Saudi Arabia? No, not at all, because they don't even have Saudi Arabic models to begin with.

Hermes Frangoudis (24:55)
Wonder if they're doing it in Greece and Cyprus where my family's from. Those are very different dialects of Greek that it's funny when you see videos on Instagram how it's pronounced in one versus the other and it's just like, man, you don't even realize how different it is. I know,

Lily Clifford (25:09)
I know,

yeah. No, people don't realize that someone in Saudi Arabia couldn't even understand someone in Morocco.

Hermes Frangoudis (25:13)
Yeah, because it's just completely different dialects of Arabic.

Lily Clifford (25:15)
It's

It is like Spanish and Italian.

Hermes Frangoudis (25:17)
It's wild. You guys, obviously, with Rime power, a lot of real-time solutions for enterprise businesses, major brands. I don't know if you can name drop some, like what are some of the things maybe you learned from these deployments that's maybe surprised you similar to that fact with like in Atlanta.

Lily Clifford (25:35)
There's a lot of things that have surprised us.

I would say I was on this panel at VAPI con yesterday about emotion and voice and everyone, this is just my kind of pet peeve, I think is like, if you're selling a voice agent to the chief marketing officer of like a Fortune 100 company, they want it to be expressive, right? Like they want it to be like really highly perfect.

Hermes Frangoudis (25:54)
Yes.

Lily Clifford (25:54)
And then what we found in production is like the best performing voices when you're able to measure call success, like in fast-food, for example, like if someone completes an order, you have an upsell. That's a pretty successful call, right?

Hermes Frangoudis (26:05)
Yep, exactly.

Exactly, spent their money. Success.

Lily Clifford (26:07)
Yeah,

Success, 100 % success. Then they said thank you at the end, right? So 110% success. What we find is like the best performing voices are not those polished perfect voices.

They're like,

Hermes Frangoudis (26:19)
the ones that sound human.

Lily Clifford (26:20)
Yeah, what does it mean to sound human? Sometimes it means like, you sound really bored. Like you're not like, you're not an actor. Like I'm not an actor. I mean, sometimes I'm acting right now on podcasts,

right? Like if I was just talking to you, normally it would sound pretty much like this, right? But if I was reading something out and I would be like, 'Thank you so much for calling Domino's pizza. What can I do for you today?' right? Whereas if I was actually working at Domino's, I'd be like, 'Hey, thanks for calling. What can I get for you today?'

Hermes Frangoudis (26:36)
Look.

Yeah, the inflection would be different. You wouldn't sound overly excited. You'd sound, like you said, mildly bored, but trying to foot some sort of excitement to fit the script.

Lily Clifford (26:56)
What we found is the downbeat voices are the ones that people feel most comfortable talking with.

Hermes Frangoudis (27:01)
Probably doesn't come across as not yeah.

Lily Clifford (27:03)
It's not phony.

I haven't really, like,

and I've come to hear phony so, like, easily.

Hermes Frangoudis (27:10)
It's true, because you hear a lot of voice models and they're like, hello!

Lily Clifford (27:16)
Yeah.

Yeah. And like some of our customers do want like a higher level of controllability, like, you know, someone's building a training simulation for a contact center and they want sometimes it to sound angry, right? They want control over that. They want it to sound

Hermes Frangoudis (27:26)
Yeah, They want to be forced out.

Lily Clifford (27:29)
They want it to sound

Hermes Frangoudis (27:30)
They want to be forced out.

Lily Clifford (27:31)
Yeah. A 100%. But like at the end of the day, customer experience, like customer facing customer experience isn't, I hate to say it, but it's not rocket science. Like you want it to be a good experience.

And part of being a good experience is not sounding so force and phony, I think.

Hermes Frangoudis (27:41)
Yes.

Yeah, no, this is like super interesting making me think about my brain's going a million directions. So we're going to keep back on the some of the things that we had planned to talk about. But in terms of

Lily Clifford (27:51)
Awesome.

Hermes Frangoudis (27:55)
like call centers, drive-throughs, phone ordering systems. These are all like chaos of real-world right? Like the real-world is not neat. Like you said, the cookie-cutter stuff. So how do you guys test on that? Is that more on the, customers or is there a way that you guys have developed at Rime Labs to, to do this?

Lily Clifford (28:19)
It's like the big thesis is making it easier to ship applications in voice. And making it easier to ship applications in voice means building capabilities into the model we've been talking about that allow you to customize the model and also tooling to make that way easier such that you don't even need to know the International Phonetic Alphabet because dirty secret, not so dirty, right? I don't know if Agora has a linguist on the team. You know what I mean? And by the way, no one does.

And so if I go to a customer, like, 'Just use International Phonetic Alphabet. It's easy'. They'd like, 'I actually don't know how to do And so like having a tool where you can just record yourself saying the word and then it's live in production pronounced correctly without you having to know the International Phonetic Alphabet. Like that's the kind of thing we like building.

Hermes Frangoudis (29:00)
Okay, so the tooling

around getting people in there and lowering the barriers.

Lily Clifford (29:05)
Lowering the barriers, right? Like making it easier to pave a path towards a 100% accuracy. Because like the status quo, I'm not joking. Like status quo is like, you're like, okay, you're a customer of Agora, right? Like you're a customer of Agora and contact center. You're the VP of Customer Experience somewhere. Like an Agora like has this developer platform for helping you build these experiences, right? Globally. And then like, what is the status quo? The status quo is like the VP of customer experience is like complaining to people about.

Like they're calling themselves and being like, brand name is pronouncing incorrectly. And that's not great, right?

Hermes Frangoudis (29:37)
Yeah, that's not what you want to be hearing.

Lily Clifford (29:41)
No, and

like we don't have control over the full application. Like we're just doing the voice that the customer hears, right? And building a successful voice application is extremely difficult. Like the amount of moving pieces now are innumerable, right? People are fine tuning like 15, 20, 30 large-language models, depending on how complex these calls are. And each of those language models is handling a very particular part of the call and like the handoff and the tool call. It's like enormous problems face. And I don't think

anyone has figured out how to do evaluations in a way. I mean, people have figured out how to do evaluations, but it's always a moving target, right? Because there's capabilities increase and the application surface area increases. So it's like total wild west. We just want to do our part to make it easier.

Hermes Frangoudis (30:22)
I love that, just doing their part to make it easier. That's like a pioneer builder type mindset. I love it.

Lily Clifford (30:29)
The developer product and go-to-market remains underappreciated, even if you're selling to enterprises.

Hermes Frangoudis (30:36)
No, that developer experience makes a huge difference in terms of adoption and how you get those internal champions to kind of like build with your tools and really push forward and supporting you as a provider, right? Like here's real feedback. Here's the challenges we're facing so that you can help them be successful.

You mentioned how some of the best voices in terms of conversion and success are the ones that sound very human, right? Like they're bored, they have these inflections and how has that been like from a customer perspective? Like you go into a meeting with

some senior executives and they're like, we want to do these voices and they're very set on like a style and you're like, actually, let's tone it back. We're going to do these like really simpler, more bored style like human. Do they go, no way. Or are they like all in there? Like we trust you. They push

Lily Clifford (31:22)
Yeah.

Most of the time they do.

Hermes Frangoudis (31:27)
in there? Like we trust you. Yeah.

Lily Clifford (31:27)
Most of the time they do. Yeah. Most of the time they do.

No, and it depends on like, it depends on the go-to-market. It depends who we're selling to. Like we sell to so many platforms, right? Like an agent building platform that just does fast-food or an agent building platform that just does customer support for these kinds of businesses or you name it, right? And like, when you're one of these platforms, you're competing against every other platform.

And so like it is oftentimes like the Chief Marketing Officer, where the end customer doesn't mean the person who's calling that's in my mind, that's the end customer or the end customer is like the company that's buying the voice agent. And so like, it's oftentimes their opinion about which voice and like, they don't have a lot of control about like my whole narrative about like, you need to A/B test voices and stuff. They're like in demos, right? Like they need to be like, and people all the time. And so.

Hermes Frangoudis (31:56)
Yeah.

Exactly. And you're in a meeting going,

'hi, thanks for calling'.

Lily Clifford (32:15)
No, I'm like, oh no, you should actually

go with this demo. they're like, F you. I wouldn't get the deal if I put that voice in front of it. So like there's that. And I totally recognize that as a dynamic. Just always try and educate people. Right. Like, and people are generally like, yeah, I know, but I have to put this demo in front of someone. But if we're selling directly to an enterprise who's building in-house, or they're in production already somewhere and have trust with the end customer.

Then that's where people like to experiment.

Hermes Frangoudis (32:41)
That's where you get that opportunity to be like, try this and watch how much better it makes the experience, even though it feels almost counterintuitive, right? This is not the thing that presents well in the boardroom, but on the ground, this is what converts and works best.

Lily Clifford (32:54)
Like I promise you that if you put this voice into production, like 20% of people will be saying thank you. And right now, 0% of people are saying thank you. Even though you would not think that was true.

Hermes Frangoudis (33:02)
That's huge.

And that's what I want to highlight in this like so anyone listening it's the boring voices the ones that aren't overly acting that sound the most human, right? Like that's the advice, is don't be afraid to try those out and see how they can convert.

Lily Clifford (33:18)
Yeah, and and with that said, like every use case is different. Like I think there's many use cases in which like the polished, you know what I mean? So that's, it's not necessarily the takeaway is like the always the bored Gen Z voice is like converting better, but like people should have an open mind. I think that's, that's all I'm trying to say.

Hermes Frangoudis (33:29)
Okay.

Got you. Be open to considering

kind of all the different parts of that spectrum, I guess you would say, of emotional range and relatability and tone.

Lily Clifford (33:44)
I think like, I mean, there's so many times like it's unfortunate, but it's true. Like we have this really strong palette of voices in our platform today, like voices you would never hear anywhere. And sometimes those voices are, you know,

like people come to us and they're like, this doesn't sound like customer support to me. And really what they mean is like, this doesn't sound like someone from Iowa. You know I mean? And so like that always kind of, that's my pet peeve, I would say more than anything is like, I guarantee you if you put like a voice that sounds like someone who has lived in Oakland their whole lives, like that it would be truly a magical experience.

Hermes Frangoudis (34:07)
Fair enough. Yeah.

That's super cool. Aside from the variations in tone and regionality and things like that, you talked about no one has like support for Arabic or certain type of like Catalan Spanish. What kind of interest are you seeing from the enterprise customers around multilingual support?

Really, what are the challenges aside from data collection? Is it all data collection or is it other challenges as well?

Lily Clifford (34:49)
It's, demand is honestly overwhelming, totally overwhelming.

It's why we need to, it's why we need to grow the team. It's like, but you look at like web traffic to our website and like, there's more web traffic. And we've done no paid marketing anywhere, let alone in India. And like three times as many people in India are looking at information about how to voice agents than there are in the United States and Canada and Europe combined. And like, the data is a huge part of it. And it comes with an understanding of like what people want to send to the model.

Hermes Frangoudis (34:52)
It's a good spot to be in.

Lily Clifford (35:15)
At the end of the day, because like it's one thing to train a Hindi text-to-speech model. It's another thing to train a Hindi text-to-speech model for customer support where like every other word is like literally in English, account number, inquiry, billing, right? And so you have to build for those use cases. I mean, data collection is probably the biggest part. I would also say like the linguistic awareness of what it takes to build in some of these languages is non-trivial. Like Arabic,

Hermes Frangoudis (35:19)
Mm-hmm.

Lily Clifford (35:39)
And by the way, we're going to have like, Rime is going to have extremely high quality Saudi Arabic text-to-speech models that are running self-hosted so that like the demand in Saudi Arabia can be met. Because, by the way, they have really strict data residency requirements. It's like they can't even use a cloud provider. Anyway, I digress, right? No, everything

Hermes Frangoudis (35:52)
Alright, stuff.

No, everything

has to be in country.

Lily Clifford (35:57)
Like,

But the point being like, we're going to train the highest, like a usable, not only usable, like amazing Saudi Arabic text-to-speech model before we even have language models that can generate texts in Arabic that sounds like it's someone from Saudi Arabia. Sad, but true.

Hermes Frangoudis (36:10)
wild.

And when you talk about that Hindi model or like the Indian call centers, you brought up a very interesting topic around like swapping between languages, which in certain cultures, in certain languages, that's just how people speak, right? And that's a certain nuance to even their own voice that varies from like family to family. So yeah, they could be speaking Hindi and then it's

English account numbers, it's English words. It's like that weird mix that it can't just be one language model. The one language model wouldn't be able to sustain that kind of weird switching because it would have to, it would always have to be a bilingual at least.

Lily Clifford (36:50)
And that's like the design decisions that we often have to make is like, if you train one model for every language, like sad, but true, like not a very good model for these use cases. And again, there are a of reasons after the controllability, right? Like you don't want it to randomly start sounding Spanish when you don't want it to, all this stuff. Like the design decisions that we have to often make are like, okay, we know that shipping monolingual models results in higher quality, but like what are the like languages that we can package together?

To make something work right? Like how can we have a Hinglish model? How can we have a Spanglish model, right? And like, of course, over time, I'm not saying like the end-to-end approach isn't gonna bear out with a lot of fruits, but like today you just have to be very strategic about it.

Hermes Frangoudis (37:29)
So it's about making that consciousness different.

Lily Clifford (37:29)
You don't necessarily need a

Hermes Frangoudis (37:31)
about making that consciousness different.

Lily Clifford (37:31)
You don't necessarily need a Hindi

model that can also speak Mandarin Chinese, right? It's not there, yet. But you definitely need a Hindi model that can say customer support in English, like 100% of the time.

Hermes Frangoudis (37:35)
It's not there yet.

100%

So as a startup, how do you guys prioritize R&D ambition over like revenue goals, right? Like you're the CEO, you're pushing the frontier forward, but you got to make money. Like it's a business. So how do you kind of like, is it mostly R&D? Do you push the envelope more or is it just about meeting the customer needs?

Lily Clifford (38:03)
I mean, we're really lucky to be in such a fast-growing market and one that is so innovative everywhere around us. like R&D means meeting customer demand.

Do you know what mean? Like we're not trying,

Hermes Frangoudis (38:11)
That's awesome. So you're getting paid to do the cool

Lily Clifford (38:14)
No, we're not like, oh, you know what we needed to do to stay ahead is like do the thing that our customers aren't even asking us to do yet. Like that would be an interesting position to be in.

Hermes Frangoudis (38:24)
The industry is not there. The demand is just not being at the current

status, right?

Lily Clifford (38:28)
No, literally. We're also really lucky that we have partners that can help us figure out where the demand is. We wouldn't know that there's demand for X, or Z language with A, B, and C capabilities built into the model that can be deployed in D, E and F ways, unless our existing customers are like, hey, we really need this. In that sense, we're lucky.

Hermes Frangoudis (38:51)
You have like that approach to the business. Like this is the customer need that we're satisfying. We're not building what people aren't asking for because there's enough business to just build what they're asking for.

Lily Clifford (39:03)
There's a little bit of mix of both. When we build these speech availability features, these aren't things that exist in any other platform. Not to be too Steve Jobs about it, but sometimes you do have to build something that people don't know they want. But it's because you see their problems firsthand. And so in that sense, it's still very customer-driven. You just think they're not asking for it because they don't know it's possible.

Hermes Frangoudis (39:08)
Thank

Okay.

I've heard this said once before around customers. And I don't know if it was just Jobs or like Henry Ford. If I asked my customer what they wanted, he'd tell me a faster horse. And it's about listening to what they want, but then building the thing that actually solves their problem.

Lily Clifford (39:33)
Yeah. Right.

Correct.

I mean, most of the times they ask what they want and then they're right.

Hermes Frangoudis (39:40)
Yeah.

Yeah, most people know how to articulate, but like you said,

Lily Clifford (39:45)
Most of the times when someone's like, 'I want a faster horse', you're like, okay, got it, 'we can get you a faster horse'. We'll get you the faster horse.

Hermes Frangoudis (39:45)
don't.

Yeah, we'll get you the faster horse.

you're also building the car around it. You're like, actually, you need this thing called an automobile.

Lily Clifford (39:54)
Correct, correct, yes, yes, yes.

Right, yeah. And like the middle-ground is we're gonna have this thing called a carriage.

Hermes Frangoudis (40:02)
Yeah,

exactly. We'll be, everyone's covered. In terms of the overall like broader voice AI space and the frontier, like it's very much bleeding-edge. Is there any trend right now that you feel like is overhyped or is there any trend that you feel like is like super underrated?

Lily Clifford (40:17)
I think it's always got to be a mix of both. When I think about speech-to-speech models, I think people underestimate the time horizon on how much this stuff will play out. There's this falsely disrupted speech-to-speech model that is going to. People ask me literally all the time, are you threatened? And I'm like, no, because.

Hermes Frangoudis (40:33)
by the all-in-one. Like

Lily Clifford (40:34)
No, no, that's what I'm saying.

Hermes Frangoudis (40:34)
all-in-one voice.

Lily Clifford (40:36)
right? And am I threatened? No, because we'll be building them. Why would I be threatened? It's just not going to be this year because they're not ready.

Hermes Frangoudis (40:42)
The tech isn't there.

Lily Clifford (40:43)
No, no. by the way, the distribution channels that we have with our existing customers allow us to build these models in a very particular way. Speech-to-speech all-in-one is here for consumer applications. And not that many of them, by the way. Basically, only Grok voice mode and ChatGPT Advanced Voice Mode. Those are the two consumer speech-to-speech models that exist today.

Hermes Frangoudis (41:06)
And they still have really, really high inference times as compared to like the overall streaming time and all that sort of like when you put it all together.

Lily Clifford (41:16)
So like, I just think there's going to be different tools and those tools are going to be developed at a different time horizon. And like, that's exciting to me. So I'm not threatened. I'm like, no, like Rime is going to be the platform where enterprise developers who are building these highly accurate applications need highly accurate speech-to-speech models that are fine. So like, it's all at once overhyped and people don't realize that is the future, but they just don't realize what the future looks like. Do you know what I mean? Like the details of the future, I don't think people realize.

Hermes Frangoudis (41:24)
different speed.

I got you. It's overhyped in the sense that like, people are pushing it forward on the consumer side, but the long-tail is not quite understood as commonly.

Lily Clifford (41:52)
think

Parallel would be like, if it was back in the '70s and '80s, and like, there was this new computing primitive that was like gaining traction and enterprise use-cases called a database. And like people were like, well, we have got database solved. Done. Like there's going be one database company done, you know, like no, no innovations. And by the way, there's going to be just like this one kind of database is the perfect database. I'm like, that's going to disrupt every other, you know, like computing paradigm that we have for storing data. And then you look at like how

because of how important databases are, there's 10,000 different approaches and 10,000 different companies that are worth billions of dollars, right? Built around them, built in particular around particular primitives. Like if you're building a system where someone can order something in Europe and you need to maintain inventory for that one item so that someone can't buy it also in United States, like you're going to want to have different database primitive, right? And so I think models are the same way. We just don't.

Hermes Frangoudis (42:18)
Let's play it out.

Yep. yeah.

best.

Lily Clifford (42:43)
It would be like we're in the '70s and 80s and we're like, yeah, database, that's done. So like A would be like, yeah, database overhype, but in the way that like people don't realize how rich it's going to be in the future.

Hermes Frangoudis (42:46)
Thank

Yeah, no, it makes sense. It's

at the point where things are changing so much that it's worth investing into the differences and in that sense. So how far do think we are from truly conversational voice agents that feel human, right? Like on the scripted side, they're good. And in different cases, there, there, there's a lot of playing around with it. Do you think there's a point where

becomes a little bit more frictionless and less trying to play around and the model's just kind of adapt.

Lily Clifford (43:24)
Hmm. It's like, there's kind of two sides to that question. Cause like, if you swap like an enterprise phone tree and IVR system for like speech recognition, LLM, TTS with like no guardrails, just a single prompt. Like you're going to find this like three times worse than the IVR system that like helping users accomplish what they actually want to accomplish. We see it's over and over again. We see this like someone prototypes with this very simple architecture and what they find it doesn't work. And so like

to the extent that people in enterprises are building really highly conversational voice experiences where people are accomplishing a lot and have a high level of positivity around the customer experience. The applications themselves, the most frictionless applications are in many ways the most complex. People fine-tuning language models, they have all these guardrails. They have a deep sense of what someone's trying to do and they build for that.

Hermes Frangoudis (44:12)
Gotcha.

Lily Clifford (44:12)
But in that sense, it's not really frictionless. If you're building for food ordering, it's for food ordering, it's not for having a conversation. Yes. And so I don't know, the future I think is one in which you call Domino's and you can order, but you can also ask a bunch of questions. You can also be like, by the way, 'did you watch the game?' And that feels normal. Right now, that's not where things are at, it's just not. And so that's, I think, what it would take. And I often think on the speech-modeling side,

Hermes Frangoudis (44:20)
Yes, makes sense.

Lily Clifford (44:37)
it's not going to feel truly, I'm not sure if enterprises will like this, but it's not going to feel truly conversational until the voice agent can interrupt you.

Hermes Frangoudis (44:46)
Yes.

Lily Clifford (44:47)
Which

doesn't exist today.

Hermes Frangoudis (44:48)
And people go on long rants where someone needs to butt in and be like, wait, did I understand this part correctly? That sort of thing.

Lily Clifford (44:55)
I mean, think like, it's like today, any voice agent, and sad to say, but like the most sophisticated voice agents, no matter how lifelike and frictionless they already use are basically a form. Like you say something, I say something. You say something, I say something. I say, I didn't get that. Can you say it again? You say it again. Like there's not a lot of back and forth. I mean, it's back and forth, but it's completely turn-based. You know what I mean? So I think that's the future too, when it's like the bot will be able to interrupt you and be like, 'no, but what I meant is'.

Hermes Frangoudis (45:20)
'Yeah, no, no, sir, sorry'.

Lily Clifford (45:22)
'Yeah, no, no, no, actually,

no, you don't even tell me that. I totally got that already. No worries'.

Hermes Frangoudis (45:25)
to.

Here's another kind of interesting one more on like the frontier side of it. If you could make like wave a magic wand and instantly fix one technical bottleneck with voice AI.

What would that be?

Lily Clifford (45:38)
I think right now, the one thing that's true, by the way, and people don't realize this, about the monolithic speech-to-speech models, as well as the cascaded approach with different models, is there's not a lot of state management in the emotional character of how you're interacting. In the cascaded system, it's particularly obvious because someone says something and then three turns later they say something. There's not a way that the bot responds in terms of inflection and prosody.

There's no reference to three turns ago, no reference whatsoever. And like people, don't, people think that like there are different right ways of tackling this. Like I think we have a thesis on it. I think everyone who's building voice models for customer experience also has a thesis on it. I think that people are building speech-to-speech models just say it's already solved, even though it isn't because like the context windows of these speech-to-speech models are actually very narrow to begin with. Anyway, point being like that I think is the frontier is when like,

the way that you say, I'm so sorry to hear that depends on like the thing that the user said five turns ago.

Hermes Frangoudis (46:33)
Yes. So keeping that emotional context and awareness throughout the conversation and not feeling like it's just a piece of the pipeline that things are going through.

Lily Clifford (46:44)
Yes,

Yes, correct.

Hermes Frangoudis (46:45)
interesting. Yeah.

Lily Clifford (46:45)
And it is just how it is, right? Like someone's like,

like has an exclamation point at the end of the sentence. And they're like, they come to the voice model provider and sounds too excited. I'm like, Oh, there is an exclamation point. You know what I mean? It's not, not the most, it's not the most helpful thing to tell a customer. It's like, what a period, you know?

Hermes Frangoudis (46:57)
I'm supposed to be excited.

Yeah. no, that's very interesting. Cause you're, as you say these things, I'm like thinking back to all these experiences of talking with, with AI models. Cause some might do basically like every day I sit and we demo voice agents and we build voice agents and I totally hear every piece of this. I'm like, man, I just never could put my finger on it, but that's exactly what it's doing.

It's just not keeping that context or not having that emotional awareness in the same way that a person could continue to have that frame of reference, right?

Lily Clifford (47:25)
Yeah.

Yeah, there's very little emotional frame of reference, exactly.

Hermes Frangoudis (47:35)
So you talked about earlier how interesting it is to take like an existing LLM,

put audio to it and kind of learns. And do you see some sort of eventual like convergence of these models or do you think they're always going to kind of be separate? Like TTS will be its own piece, the LLM or like you said, is it eventually going to all converge to that all-in-one voice-to-voice?

Lily Clifford (47:58)
I definitely, here's the one main issue, by the way, with, like, multimodality is that we haven't figured out, as much as people like to say that we have, like, we haven't figured out how to make a multimodal model that retains the reasoning capabilities of the underlying text model.

And that's a problem. I've definitely seen that. I thought, by the way, I'm not trying to knock anyone's work, because I played with Sesame and the Sesame demo is so incredible. So incredible. The character of the speech was so natural and engaging. And you could also tell it's kind saying the same four things over and over again.

Hermes Frangoudis (48:13)
Yeah, I've definitely seen that.

You start to see the pattern that makes it feel real at the beginning, but not in the long term. It doesn't hold.

Lily Clifford (48:34)
Yeah, yeah.

Yeah. And I think a

I think a lot of that has to do with like, in order to get that really rich sounding speech, it needed to become kind of dumber. ⁓ Sad, but true. And so like, think I can imagine a future in which like the underlying reasoning capabilities are so strong that we can take a hit to reasoning and it still feels like a powerful system. But definitely I can see us like in the very near term training large-language models again on these texts backbones, like

Hermes Frangoudis (48:44)
Mm-hmm.

Lily Clifford (49:02)
where they're trained in parallel to predict text tokens and speech that represents those text tokens. And so essentially what you have is a reasoning model that is a drop in replacement for your current LLM in a particular part of your call flow that is also generating speech at the same time. And therefore you wouldn't even have a text-to-speech model anymore.

Hermes Frangoudis (49:18)
Yeah, you just have that one model that kind of works on both pieces.

Lily Clifford (49:20)
you have that one model and people are

People are doing this from both sides, right? Like there are speech models that take speech in and produce text out. And so people are gonna, this is truly wild west is like everyone's racing towards this future where there's truly multimodal models, but people haven't figured out what the intermediary steps are that are useful. Cause like, I don't want to build something that's not useful. You know what I mean?

Hermes Frangoudis (49:42)
waste of

Waste of time. Are there any like researchers in the space that you're really looking at and going the kind of stuff that they're doing is really cool. Maybe it's not exactly what Rime Labs is trying to accomplish or maybe it is but more so like what really cool like research should people be looking into and kind of follow?

Lily Clifford (50:01)
look at the research or the Google Tacotron team and everyone at that team, by the way, if they see this, you should know I'm a huge fan of your work. But yeah, like the people on that team are just so creative and smart and pushing an envelope on what it means to have like enterprise-quality speech models that are like really truly unique. And so RJ Skerry-Ryan, Julian Salazar, they know who I am and I just want to shout them out. I think they're doing really awesome work. Shout out to Andrew, like

member of technical staff at OpenAI building multimodal voice. Like it's not easy and you guys are doing great. That speech modeling world is very small. so like shout out to those guys. Shout out to Shivam Mehta who's now at Netflix. They're good with my friends, but.

Hermes Frangoudis (50:39)
It's awesome. Shout it out. I love to hear it because it's important. It's such a small industry that you got to kind of promote each other and help each other.

We're getting to the top of the hour. I'm enjoying the hell out of this conversation, but you're also the CEO of a company and have probably a lot better things to be doing than talking to me. So I got one last question, if that's all right. Yeah, this one's a bit of a wild card. Usually I ask like what you would be doing

outside of Rime and the stuff that your team does, but I'm going to switch it up a little bit on this one. What's your linguistic pet peeve that you notice in like AI voice systems that no one else either seems to catch or is just kind of cool with?

Lily Clifford (51:17)
Here's one that no one's cool with, in English,

at least in American English, like the American English that most of us speak, right? And it's true increasingly of international varieties of English too, right? There are many sentences that we would put a period at the end of the sentence that people deliver as if they're with a rising intonation. And there are many sentences that we have a question mark at the end of where there's a falling intonation. And this is, by the way, a really big problem in customer experience because like if you're asking a question, you want it to sound like a question. If you're saying a statement, you want it to sound like a statement. And this is by the way, a really weird intersection because like.

Other people's pet peeve is the fact that people do this whatsoever. They call it uptalk. Uptalk, we all do it. But then the problem becomes when you're training a model and you really want to control whether or not it sounds like a question or not. Every user of text-to-speech has noticed this. I put a question mark, but it doesn't sound like a question. And it's like, well, actually, most people would say that sentence as if it wasn't a question. And so that's the stuff I think that everyone notices. And it's a weird one because

Hermes Frangoudis (51:50)
Thank

Lily Clifford (52:11)
It's people's real life pet peeve that like, you know, people do up talk. It used to be that we thought it sounded like Valley Girl. Now we just know that people do it. Like that to me is a pet peeve too as someone who's building these models. Cause like, I do want people to have control. So it's this weird tension of like, it's what real people do. But if you're building a voice experience, you don't want that to be the case. It's really weird. That's, that's my tension often.

Hermes Frangoudis (52:33)
It's an interesting one.

But I also see what you mean like putting the question mark you're like it didn't just magically create some inflection at the end of this this word it. Question marks is meaningless. Yeah it's like the whole sentence, right?

Lily Clifford (52:42)
Question mark is meaningless, by the way.

Text, by the way, is meaningless. We've only been writing for 10,000 years. And so we have this really lossy system that everyone knows, that everyone uses every single day in literate societies. And by the way, it kind of sucks in many ways.

Hermes Frangoudis (52:49)
Interesting. Yeah.

Oh yeah, I'm a big fan of the voice-in, voice-out, even if it's cascading. 100%, yeah.

Lily Clifford (53:04)
Oh, 100%. Yeah.

Keyboards are dead, you know, that we just don't know it yet. So I look forward to our weird technological society that's at the same time post-literate. I think that is very within the realm of possibility.

Hermes Frangoudis (53:18)
I watch my kids, how they interact with like a cell phone and even AI, like I built an agent just for them with guardrails and they love to talk to this thing. And it's really cool to watch that. Like you said, they don't even care about the text or the words. It's more so like how it interacts and how it engages with them and how it helps them accomplish their goals, whether it's like asking a question or a story.

All right, Lily, I really appreciate your time. Thank you so much for this amazing conversation. For everyone listening along, thank you so much for your time. And if you're following on socials, do that social thing. Like, retweet, subscribe, follow.

Lily Clifford (53:54)
Like, subscribe,

Hit me up. Yeah.

Hermes Frangoudis (53:59)
Yeah, message us, let us know.

Check out Rime, probably one of the hottest TTS providers right now. And we'll see you on the next one. Yeah, check out Agora, conversational AI engine. The developer platform for building convo AI. I love it. Thank, Hermes. Thanks, Lily. Bye,

Lily Clifford (54:07)
Check out Agora, the developer platform for building conversational AI. Awesome. All right, thanks, Hermes.