Hermes Frangoudis (00:07)
Welcome to the ConvoAI World podcast interview builders, developers, and the teams that are really forging ahead the solutions and the conversational and voice AI world forward. And today I'm so lucky to be joined by Andrew Seagraves, VP of Research at Deepgram. Thank you so much for joining me today.

Andrew Seagraves (00:29)
Thank you for having me. I'm also extremely excited to be here. Great to be chatting with you.

Hermes Frangoudis (00:33)
The other day when we met, it was like, we had a really great conversation. So I've been looking forward to this, all week, basically. For anyone that's not aware, Deepgram is one of the leaders in speech-to-text recognition, and most likely you are using them and might not even know it. So welcome and

let's kind of get into it.

Andrew Seagraves (00:52)
Okay, sounds good.

Hermes Frangoudis (00:54)
So what motivated Deepgram to build a speech-to-text  engine, kind of from the ground up using that end-to-end deep learning?

Andrew Seagraves (01:01)
Well, that's a great question. That takes us back to the early days of the company when it was founded and they chose to focus on that early on. That is not actually the first problem that they tackled. So the founders were originally dark matter physicists and they were doing a lot with audio.

They were using audio signals and shooting them into the earth and then measuring what came back. And then trying to use machine learning to understand whether or not there was dark matter present. And so they had some very strong expertise that they were building, but applying deep learning in their particular application. This is back in 2015 timeframe. They were also doing weird stuff like recording themselves.

Hermes Frangoudis (01:27)
That's

Andrew Seagraves (01:42)
for weeks, like all of their audio, wherever they were, they attached mics to their clothing and they were recording themselves. And so they amassed this very large volume of audio from recording their everyday lives. And they were like looking at their machine learning stuff that they were doing, looking at this thing, this large corpus that they would never be able to listen to, to like find the interesting tidbits. And they decided to try and tackle the problem of like using the machine learning to search the audio, just as a side project.

And that was how Deepgram started. They went and founded a company. They built a deep search algorithm. In doing so, they actually indexed all of YouTube at the time. And you could find random audio clips in a YouTube-scale corpus. They demoed on stage at GTC with Jensen. This was very early days. But they realized at that time that search was not a hot thing, and there wasn't a big market for it.

Hermes Frangoudis (02:24)
Okay.

Andrew Seagraves (02:31)
So speech recognition at that time was an emerging green field and there were very few players and all the models were terrible. And so they had this strong conviction from the beginning that if you combined a system that's learned end-to-end, a single network and lots and lots of high-quality data that you could build a model that could transcribe sort of like potentially any human in any situation. So it was like, it was just one of the early convictions that they had.

They went about building an early prototype, it's like 2016 timeframe.

Hermes Frangoudis (03:03)
That's super interesting. Gotta love that gap in the market, right? Like everything else is terrible and you're like, actually, I think if we do this, it could solve this solution, right?

Andrew Seagraves (03:13)
Yep.

They also kind of got lucky in that there were some early AI adopters from call centers, basically like AI platforms that had many call centers as customers. They had these huge volumes of call center audio that they wanted to transcribe and do analytics on. And that particular domain like is narrow enough that the very early deep learning models that we had actually worked. And you could train on these narrow domains and produce models that were like 80, 90% accurate.

Hermes Frangoudis (03:36)
Okay.

Andrew Seagraves (03:42)
If you just specialized the data. And that was like some of early magic of Deepgram models, that they worked for particular applications where there was an interest and a lot of data and people who were like willing to try to use the models. So that was one of the reasons that we have built for scale early on too.

Hermes Frangoudis (03:58)
That's super interesting. It was kind of like the approach kind of led them down the path, right? Like, hey, we're here. We have this stuff. We can apply it here. And now because call centers, like you said, also have that very rigid structured type data, they could apply it there and kind of scale. That's super cool. What are the most common misconceptions people might have about speech recognition?

Andrew Seagraves (04:24)
I would say the biggest one is that it is a solved problem. I think even Jensen has said that in like his most recent GTC keynote, he said speech recognition is a solved problem. And I think that it is definitely not a solved problem. It's only "solved" in some very narrow situations. So like it works well in situations where we have a ton of data.

So there's particular use cases like call center audio, for example, in English. We have a lot of data. We can train models at scale. And this paradigm of large expressive deep learning model trained on lots of data, it works well. And then we've, over time, collected enough data across a lot of niche domains, or what would have been considered more niche domains a few years ago, in English. And so the models have gotten really strong in English. But I think that in non-English,

languages, the models are in general still pretty terrible across the board. And it's just because of the lack of data. And then I would say beyond just having data to cover the very broad range of speakers and acoustic conditions that you're trying to model, the other big challenge that is not solved is being able to recognize rare and localized words, like words that are specific, say, to a particular customer or a particular person.

Like how their name is spelled, for example. Words will continue to be a challenge moving forward, like getting the words right. It's like one of the core challenges that's sort of underappreciated, I would say, in speech to text.

Hermes Frangoudis (05:49)
I feel like it leaves a lot of room to grow, right? Like there's still very early days on this sort of thing. That challenge is kind of what you guys have solved on the English side of it. And I'm sure you're going after international and like colloquial, like localized dialects and stuff like that. terms of how you

Andrew Seagraves (06:10)
Mm-hmm. Yep.

Hermes Frangoudis (06:13)
take the approach for speech-to-text. Really, how does Deepgram like balance this real-time latency with the accuracy of the transcription?

Andrew Seagraves (06:25)
That's a great question. So there's two major use cases for speech to text. And in your question, you're sort of honing in on real time, so where you want the model to be actually transcribing the words as soon as possible after they're uttered by the person. And that is just a very hard challenge. The other major use case is batch or asynchronous

transcription. And that's where you have like, you know, your 10,000 calls for the day that were recorded at your call center. And you want to be able to transcribe all of them in as, maybe like 30 minutes or an hour or more, like as quickly as possible. So that you can get the insights about what happened today. And so I would say like, you know, Deepgram for both of those kinds of problems, we have always worked in a constrained design space where

we are trying to achieve the maximum accuracy subject to like, it must be fast and scalable, some like very concrete performance or engineering requirements. And when you operate in this constrained design space, like there are many approaches that do not work well, that would otherwise look great, like in papers. You know, models that don't scale. Also like very, very large models become impractical,

if you're trying to hit a latency. And so it's pretty simple. Small models, constrained design space, and then actually imposing those requirements from the beginning so that you've designed with scale and speed in mind.

Hermes Frangoudis (07:53)
So really having that kind of from the start, not taking things that will, yeah, they work in batch because batch doesn't matter, timing.

Andrew Seagraves (08:01)
I

I would say, so I'll say like one more, go like one level deeper there, double click, if you will. The data actually impacts being able to solve that problem. So the more high-quality data that you have and the more localized it is for whatever particular domain you're trying to model, the smaller you can make the model for a given level of accuracy. And so there's like a joint dependence there,

if you want to think about it. The better the data you have, the more efficient model you can use. so we can leverage our data advantages that we have. So we have collected and labeled a lot of high-quality data to actually make the model smaller.

Hermes Frangoudis (08:39)
So I love that we're talking about data because data is really at the core of deep learning and that high-quality data is kind of like the grail of deep learning. And like you said, the better the quality of the data, the smaller the model, the more accurate it can be. So can you walk us through a little bit of that model training process, like the data sources, pre-processing, that sort of like architecture?

Andrew Seagraves (09:06)
Yeah. I'll start with like a big picture. How do you train like a state-of-the-art speech recognition model today? I think like the simplest way to do it is that it's a two-stage process. And this is typical across many deep learning models. You have like a pre-training stage and a post-training stage. So there's parallels to this in LLM training, the way the LLMs are produced.

Kind of the same thing in speech recognition. So in the first stage, you're trying to train with a very large scale of data. As much data as you can get your hands on, covering as many voices as you can get, as many acoustic conditions as you can get, and then as many examples of words being spoken. So this is something that is maybe like underappreciated, that you want the model to have a very broad exposure

to speakers and audio conditions. But when you do that, you also increase the frequency of particular words relative to others. So there's this interesting scale effect that when you scale up the data, the frequency of stop words, like most frequent, like ANDs and THEs, it explodes. And then, you have an emergence of a very long tail of rare words that appear. So basically, the best you can do is just keep scaling the data as much as possible,

in the first stage, and you train a model on that. So in that case, you're training primarily on data that's crawled from the web. And then you're filtering that data to the best of your ability to isolate audio that has human transcripts, where the human transcripts are good, basically. So that's the name of the game for the first stage. And you get a model that is pretty good, I would say. And that is how, for example, Whisper was produced.

Hermes Frangoudis (10:48)
Okay.

Andrew Seagraves (10:48)
Whisper is like the first stage of a production grade speech-to-text training. In the second stage, you specialize the model in like in post-training and you train it on a much, much smaller, more narrowly distributed corpus that covers just the domains that you, that you largely, that you care about. Yeah. You focus the training and in that case, the data is like very high quality. It must be like a basically gold, gold ground truth labeled

by humans that are following a very prescriptive style guide so that the labels are sort of consistent. In the first stage, you have labels that have been generated by millions of different humans with no consistency in style. And so then you have to unlearn that in the second stage. And the model's output becomes

Hermes Frangoudis (11:32)
Forget everything I taught you.

Andrew Seagraves (11:33)
stage. And the model's output

Hermes Frangoudis (11:34)
Forget everything I taught you.

Andrew Seagraves (11:35)
consistent. Yeah. And so that's basically how it goes.

The magic really happens in the second stage, although the first stage is also important. So one thing we see, one thing I'll say, so this is like something we've observed that we haven't like published I would say, is that as you scale the corpus in the first stage, there will be a set of words that you've seen, let's say like 10,000 times or 100,000 times. And those words, like you sort of saturate

the model's ability to predict those words in that if you were to show the model more examples of those words. It wouldn't help. And then you have this long tail of words that you've seen less than that threshold, let's say 10,000 to 100,000 times. And word error rate depends just directly on how many times you have seen those words. Yep.

Hermes Frangoudis (12:16)
It wouldn't help.

It's pretty crazy. So like, Whisper, essentially, the thing that everyone claims to kind of be building off of is just like part one of the puzzle. So if you're not applying part two, you're not gonna get the right performance that you should be getting out of this.

Andrew Seagraves (12:45)
Yep, that's right. And you get all kinds of what we call model pathologies that result from part one. The model will insert words that aren't there. The model will omit words that are actually present in the audio, what we call shyness.

Hermes Frangoudis (12:59)
I've seen this and

I've had people tell me, no, that doesn't happen. It never happens. it's like, but it totally happens. And you get like one thing in the audio and then one thing in the transcript and you're like, they don't match.

Andrew Seagraves (13:04)
So totally happens.

Yeah, and it's like, customers react to those two failure modes differently. When the model's inserting words that are not there, It can be very creative. And so you sort of never know what it's going to say. Oh boy. And so that leads to lot of failure modes, depending on what the model's using to say. But then the silence is just universally despised by all customers.

Hermes Frangoudis (13:21)
boy. boy.

Andrew Seagraves (13:30)
The model should produce something when words are happening. And that's a big one. Yeah.

Hermes Frangoudis (13:35)
That's super interesting to hear. I feel like sometimes missing a word is also a little bit more forgiving because you're like, all right, I just kind of missed the word. It didn't like add to, like it's a lot more jarring when there's words added to it.

Andrew Seagraves (13:49)
Yeah, that is more jarring, especially if you're seeing it in real time. It is kind of interesting that, like, real-time speech recognition is just more challenging than batch, because you're operating on, like, small chunks of audio at a time, in principle, when you really shrink down the buffer of audio that you're sending. And so that's just harder. You have less context. You don't get to see what is coming in the future.

In Batch, you could see the whole file at once, in pathology.

Hermes Frangoudis (14:14)
So are you constantly

building that? As it goes, you're building on the previous buffer that you got?

Andrew Seagraves (14:21)
Yeah, and there's a tremendous range of different things you can do there from a modeling perspective. But largely speaking, you're maintaining some state about what you've seen so far and updating it as new audio comes in. And then the model may be doing something like deciding whether or not it's going to emit a prediction at this frame or not. And so real-time opens up all kinds of interesting mechanics. But model pathologies are also more prevalent.

And then also, real time is the setting where people are watching the transcription live and they see them. So, yeah, for all these reasons, real time is way harder.

Hermes Frangoudis (14:55)
You don't have to tell us we've been in that, that space. Yeah, we're well aware. And it's not just like talking to an LLM, talking to a person and real-time voice and video streaming. That's like another one of those things where someone's talking, you expect it to happen. And think that's like one of those like fine tuning green things in our human pathology of like, we're expecting this to go along with that.

Andrew Seagraves (14:57)
⁓ You're well aware.

That's right.

Hermes Frangoudis (15:20)
So really the biggest challenge in scaling it sounds like is not only the part one of the accuracy of the data, but like not over saturating. Like you said, certain words are just so prevalent that it probably like the more you throw at it, the worse it probably gets, right? Like it doesn't help the situation.

Andrew Seagraves (15:38)
That's right. The larger, like

the more you scale the data, there's this, like everything in nature, all the like underlying physics of audio like human speech, if you want to think of it in terms of like the basic physics, like it's all governed by power laws. It's all power law phenomenon. And so like the words that people utter is described by a power law. And like, if I had the ability, I would like bring up a plot that you could see right now, or like go to Wikipedia and look at,

Zips. It's called Zips.

Hermes Frangoudis (16:05)
Okay,

We'll edit it in. like, what should we put in? What's the graphic we should show right here?

Andrew Seagraves (16:09)
It's a graphic that, so if you take a very, very large corpus of text, and let's say it's spoken text, so from human speech, and then you count up how many times each word appears, and then you sort it on the x-axis, so you have the most frequent words on the left and the least frequent words, the long tail on the right, you get this curve that is described by a power law relationship between the rank of the word and the frequency.

Hermes Frangoudis (16:15)
Mm-hmm.

Andrew Seagraves (16:36)
So if you plot it terms of frequency on the y-axis. And for human speech, the slope of that is like two or something. So the slope of that one.

Hermes Frangoudis (16:45)
So it's relatively like straight.

It's not like weirdly asymptotic or anything like that

that. Okay.

Andrew Seagraves (16:50)
If you plot it in log space, it is. The thing is that

as you increase the size of this corpus, the most frequent words just continue to appear ridiculous numbers of times, and the least frequent words are appearing once. As you go to the scale of data that we actually are training on, the most frequent words are happening seven orders of magnitude more frequently than the least frequent. You have a huge class imbalance in the data,

Hermes Frangoudis (17:14)
which is very...

Andrew Seagraves (17:15)
in the parlance of machine learning. You're basically trying to learn these, these are the classes. You have 50,000 or 100,000 classes that you're trying to learn. It's these words. And you have a huge imbalance in them.

Hermes Frangoudis (17:25)
But let's say you guys have solved this or not really solved it, but you've gotten into a point where it's good. And we're getting to a point where in real time, like we said, it's tough cause cause you're constantly checking the buffers and updating and deciding. So what are the technical trade-offs between this like real-time accuracy and maybe like compute efficiency

 because if you lock up the machine, it's not going to do anything, right? So how do you keep from overloading it or keeping?

Andrew Seagraves (17:54)
Yeah. think there's two ways to answer that question. There's several different like cost, speed, and accuracy trade-offs. And I would say the biggest one for real time, the hardest one to design for is like how much forward context you accumulate before you allow the model to make a prediction. And so that we think of as, you can think of it as a buffer or you can think of it as a right

you know, future context, that's another way of thinking about it. So you could have no future context as soon as you get an audio frame, you emit a prediction, right? If you're doing that, then you're very likely to be capturing someone who's mid-word, you know, they're in the middle of speaking a word, you actually don't know what that word is yet in a lot of circumstances. And so the model is likely to get it wrong,

if you emit a prediction immediately. If you allow the model to wait and buffer up some audio and then emit a prediction for that previous frame that you already had, it's much more likely to be accurate. So you have this waiting, you know, how much you're willing to wait versus accuracy trade-off, and the how much you're willing to wait just like determines the latency.

Hermes Frangoudis (19:04)
So that's like the queue essentially.

Andrew Seagraves (19:05)
Yep.

Hermes Frangoudis (19:07)
I remember like in AR and VR, you have like the buffer queue for the video frame, right? And then you can render onto it and then present it to the user. And it's gotta be quick before the user like realizes there's a crazy disparity between what they're seeing on the screen and reality. And I think it's the same in the speech buffer. You're building this up and then slicing.

the front, you're like, okay, okay, it's accurate enough, slice it, give them the frame, give them the frame, give them the frame, kind of like, and constantly doing that battle, right, where the cue is growing and cutting.

Andrew Seagraves (19:42)
Yes, that was a very good usage of the slicing hand motion there. Very nice visual. That's exactly right. And so if you wait a half second, then there's going to be a half second delay between what the person hears and what they see.

Hermes Frangoudis (19:47)
You

But in half

a second, that's a lot of data.

Andrew Seagraves (19:59)
That is. All the models do great if you give them a half second. Yeah, you barely even notice it. But if you shrink it down to like 200 milliseconds, then you start to notice it. And then if you go to 100...50, 100 milliseconds, there's really like real limits there where you must wait a little bit.

Hermes Frangoudis (20:19)
There's only so thin you can cut that, make that buffer before it just becomes like not good anymore.

Andrew Seagraves (20:28)
Agreed, unless you solve the problem of like predicting what that person is going to say, you know, which it does feel like we're maybe heading in that direction. You can model what's in their brain, you know.

Hermes Frangoudis (20:36)
Who knows that? mean,

If you look at what happened with like Genie right? In the video prediction, the user makes a right turn, the user makes a left turn, the user looks up, user looks down. It's there. So it kind of almost has to like forward predict that sort of thing. And I wouldn't be surprised if that comes to voice and to everything that that concept of like, okay, have just this little bit

based on all this other knowledge, can we throw something else in there to see if it works or not? But I feel like that would introduce new trade-offs in the latency, right? Because you're trying to like, I know AI and the layers are very fast in their response because of the weights systems, but you're still like trying things at that point to get the weights all right before it'll respond. I feel like

that's like a whole different battle.

Andrew Seagraves (21:32)
I agree. It seems like there's this emerging new field of real-time AI where you want to be doing these fancy things like you're talking about, predicting multiple potential realizations of what the person might be about to say, and then ranking them and picking what you think is the best one, that kind of thing. And that introduces just fundamentally new model mechanics

where you have separate models working asynchronously and then sort of checking in with each other here and there. It's probably described by some other kind of computer science operations, it's more like distributed system.

Hermes Frangoudis (22:08)
Yeah, it's

Yeah, it is like distributed systems that have some sort of a like load balancer, right? Like they're all checking back and this is the one system that's the source of truth while everything else just focuses on its own task.

Andrew Seagraves (22:20)
Yep. And maybe there's a way, like transformers, you can sort of view them as nodes communicating with each other. And there's some really cool, like Andrej Karpathy has some really cool description, paints a beautiful picture of them in that way. And maybe there's generalizations of that idea of models that operate on graphs that allow communication across these different streams. And the hope of the whole thing could be trainable.

Hermes Frangoudis (22:43)
Getting back on topic though because I know that's like future thinking. We're not quite there yet. ⁓

Andrew Seagraves (22:45)
You ⁓

This is sort of like

almost the nature of like every research meeting that we have is that we started with some topic and some question and then we're like, we totally have forgotten what that is, you know? And then we've gone 15 minutes over by the end, yeah.

Hermes Frangoudis (22:54)
You

Sorry, that means that it was a good conversation. But let's get back to Deepgram and your approach. So your approach to deep learning really differs from competitors, right? Like most of them aren't doing what you're doing. And how does it differ?

Andrew Seagraves (23:03)
Mm-hmm.

I know concretely like a few ways in which it is different. I would say the biggest difference between what we've done and what competitors have done, I mean, there's probably like two things. One is that like we will not use giant, like we just won't do giant models unless they're fast enough. And so we've worked in a design space where the models generally have to be small. And then we've really like

Hermes Frangoudis (23:30)
you

Andrew Seagraves (23:37)
focused on the data, I think. Like getting the data right is really like the biggest competitive advantage and secret sauce that Deepgram has. Now, we definitely have made discoveries about interesting architectures that work well in speech, but I would also say that we sort of like rode this wave of the large sequence model. We're like most very powerful foundational models across different disciplines, different domains in AI

are based on like sequence modeling, powerful sequence modeling frameworks like transformers. You take a transformer, you turn it on a bunch of data and generally, like trains robustly and you get a model that's like good at the end of the day. And this paradigm has like been true. It's been applied across all domains at this point and it has like continued to work. So we definitely were sort of early on that, I would say. We had a model that was kind of like Whisper

in like late 2021. And that was terrifying because it was like an encoder-decoder transformer-like model that had the ability to sort of like say anything. It would like listen to the way the model works is that like it listens to the audio and then says what it thinks it hears. That's sort of one way of thinking about it. Like the encoder gives you some rich representation of the audio. The decoder is basically just like a language model

that just predicts one token at a time, and it kind of looks at the audio and gives you a prediction.

Hermes Frangoudis (25:01)
Sounds like you're playing the record backwards.

Andrew Seagraves (25:05)
And like, anyway, it was scary to put one of those models into production

because

Hermes Frangoudis (25:10)
Would it just say random words or how?

Andrew Seagraves (25:12)
Well, like we talked about earlier, like the model sometimes will say things that are not in the audio, you know, like if the training data is noisy and there is in some cases a lack of correspondence between the audio and the transcript that you have, you get more,

frequent model pathologies. The models have degrees of freedom to do weird stuff. Where the previous generations of speech recognition models, which were all encoder based and they used a mechanism called CTC, like those models are very constrained. Like they don't have the freedom to say whatever they want, you know.

Hermes Frangoudis (25:44)
It brought up an interesting topic around like just the early days of just being early into this space and seeing it, how it matured and kind of riding the waves. Like you said, you kind of built on this very stable foundation, which allowed you to do something that was at the time probably revolutionary and even now like still industry leading.

So in this time, what verticals have you guys seen like the most adoption across? Is it call centers, voice assistants, or something that most people like audio data?

Andrew Seagraves (26:14)
Yeah, like I would say most verticals are, are, becoming like strong adopters. essentially anywhere where you have a decent volume of data available and that's now becoming true audio data. Like that's now becoming true across, across most verticals. It's just that like deep learning models that work reasonably well have been out long enough that people have been willing to use them now.

Hermes Frangoudis (26:26)
like audio data.

Andrew Seagraves (26:39)
And that leads to like vehicles to collect data and label it, you know, over a certain period of time. And so, you know, if you name the top 5 or 10 domains that you could think of, finance, earnings calls, meetings, call center, anything to do with call centers, food ordering, you know, which was historically a very challenging one is now like models work really well in these scenarios where the person is like outside or say in the field.

And you transcribe them, even across a lot of those, it works So I would say across most verticals.

Hermes Frangoudis (27:14)
That's huge. So it's kind of like, everyone needs us. We're seeing just that explosion of how AI is really helping everyone at this point.

Andrew Seagraves (27:23)
Yep. And then

it becomes like, if the model becomes robust, this is not something you asked about, I'm gonna say it anyway. The model becomes robust to that acoustic environment and like is capable of sort of transcribing anyone who speaks in that environment. But like getting the words that are rare and localized to that domain or the thing. So like you can think about medical as a great example of that. Like, and then all of the subverticals that are inside of medical,

Hermes Frangoudis (27:29)
⁓

Andrew Seagraves (27:52)
or subdomains. All the medical specialties where doctors are gonna be transcribing, say, their clinical notes, and saying all kinds of weird drug names. And the model has to be capable of transcribing all of them. So the model may be really good in terms of the acoustics of transcribing doctors in rooms, because that's a relatively straightforward, frequent problem. There's only so many rooms, range of room sizes out there like

doctor's offices. But the words are the thing that's hard.

Hermes Frangoudis (28:19)
So it all comes right back to that data, right? Like that initial data set, probably the labeling of it, like how clean is that labeling? How

Andrew Seagraves (28:21)
Yep.

Hermes Frangoudis (28:27)
well does it

Andrew Seagraves (28:27)
Even just knowing

Hermes Frangoudis (28:28)
like?

Andrew Seagraves (28:28)
the important words are, I think, becomes like a hard frontier problem. Like, what are all the medical terms that the model should know for a particular domain or a particular customer? The customer may or may not have a good idea of that, to be honest. ⁓

Hermes Frangoudis (28:40)
And they might

like mispronounce the word and have their own way of pronouncing the word, right? Because everyone kind of does that. You see it in weird, random ways.

Andrew Seagraves (28:51)
That is so true. And it almost makes you laugh because you have use cases where, like, this is a very common one where a customer is calling in to their pharmacy and, like, the customer doesn't know how to say the drug name and the pharmacist doesn't know how to say their drug name potentially. Like, no one knows how to say it. So, like, there's this incredibly broad distribution of possible pronunciations of the word.

no one even knows what the correct one is. Like how do you discover that information, you know?

Hermes Frangoudis (29:20)
But

but as a human, you're able to be like, well, I've seen this word that could kind of look like that thing, right? But as an AI, that's a different story. It's not hallucinating the mispronunciations unless it has that data point to kind of like build upon, right?

Andrew Seagraves (29:36)
Exactly. The models these days, that's a great visual that you just painted there. It's sort of like, if they've seen the model in training some number of times, they'll be able to predict that word. They will be able to produce it in its output. But if they've never seen the word before, they don't know how to spell it. And they might produce some mangled looking, acoustically right version of it. And so just being able to spell.

It's like a crucial capability

these models.

Hermes Frangoudis (30:04)
Becoming a spelling bee champion.

Andrew Seagraves (30:06)
Really what the speech recognition models are spelling, you want that they're spelling bee champs, or at least we aim for them to be. Yeah. Becoming a spelling bee champion.

Hermes Frangoudis (30:12)
Cause it's like, give it the word, use it in context, type it out.

Andrew Seagraves (30:16)
That's right. then LLMs now, the LLMs are so magical that they are, the LLMs are really the spelling bee champs, I would say, like they know how the word should be spelled in many cases. And they can see the mangled version of it from speech-to-text and go, actually, you probably meant this. And you can edit the transcript and then like sort of insert the right word. Yeah.

Hermes Frangoudis (30:40)
Super interesting. It's like

 it takes that concept of LLMs being auto-correct or auto-complete, you know, on steroids kind of applies it here too, in a way that actually helps the end user in the end experience. Like it's okay if the TTS kind of mangles certain words that it doesn't know, because hopefully the LLM will be able to kind of reinforce and fix some of those errors.

Andrew Seagraves (31:04)
Yep. And I'll just mention one thing about on that specific topic. It is interesting that that idea is now sort of coming back around again, because that is how we originally, like within Deepgram, started experimenting with transformer architectures. So in 2020, there were a handful of companies out there training

what would now be considered small to medium scale LLMs. 100 million parameters scaling up to a billion parameters, that was as big as they were back then. And so you had some early models that were open sourced that were language models. And we did some early experiments, we were taking our encoder speech-to-text model

and bolting the pre-trained LLM on. And then we had this Frankenstein model that we then tried to train LLMs. And that led to some very interesting findings initially. But the intuition there was that the LLM would be like an error corrector. That was why we first tried that. And it worked OK in that setting when you were explicitly using it for that purpose. But the LLMs were also too inventive.

So they would often make changes that they shouldn't be making, you know, and like the, were positive interventions and negative ones and they kind of cancel each other out. So anyway,  those like those hybrid, like speech-to-text auto corrector models never really, they never got shipped. but then they let,

Hermes Frangoudis (32:24)
They just didn't make it.

Andrew Seagraves (32:25)
but then they

Hermes Frangoudis (32:25)
They just didn't make it.

Andrew Seagraves (32:27)
yeah, they did lead to then like, you know, sort of discoveries about just training a big transformer on a bunch of data, at least, at least in our company.

Hermes Frangoudis (32:34)
That's huge. Kind of shifting gears a little bit. We talked a lot about like, not just multi, not just like, specialized speech, but earlier you brought up like multilingual accented speech, that sort of thing, live environments. How are you guys really approaching that sort of thing? Is it, we got to collect more data and just like hammer in on the data?

Andrew Seagraves (32:38)
Mm-hmm.

Hermes Frangoudis (33:00)
Or are there other ways that you can supplement?

Andrew Seagraves (33:03)
Yeah, there's other ways. I mean, so that's a really interesting one. Historically, we have collected data. We have our production system, and there's data streams coming into it. And for Deepgram, for a long time, we've worked on active learning, where we try to identify

amongst this very broad, extremely diverse corpus coming into our production system. What are the interesting data streams that we should sort of be sampling from? And right now, a lot of those would be like, you know, non-English data streams where we have data scarcity. And now there's a lot of people using models for non-English. And so, like, you could definitely apply the same approach that we've taken, which is like, you label a very small subset of the data that's coming in.

the rare, interesting stuff. And you basically use humans to do that, which is slow and expensive. And so there's different ways that you can use humans. You can use them for different purposes. And so what we're doing now, instead of that, we are attempting now, given that generative modeling has become more and more powerful, we're attempting to build models that can generate the data

that we need so audio and speech generation models. And this is really an emerging problem that not a lot of people are working on yet, but I would characterize it as learning how to generate complex real-world audio. So that's generating both people speaking, a very broad, diverse speaker base, and then also the complex audio conditions that they're in.

So reproducing the noise and echoes. Yep, reproducing the noise. Maybe they're on the street talking into their phone, and you have traffic noise. You have other people in the background. They're in a crowded cafe. You can imagine all kinds of different scenarios. And so this is a frontier problem that we're trying to solve. We think that this is going to be the fastest path to massively improve non-English, especially low-resource languages, where we're just simply not going to have

Hermes Frangoudis (34:37)
reproducing the noise and the echoes.

Andrew Seagraves (35:05)
data scale, and it would take way too long to be able to accumulate the scale that we need. We think we can just generate it. And so when you change the paradigm, now it's not like, let's just use like humans brute force to label data. It's like, let's use humans in other ways. You know, if we have a sort of like a finite human effort budget that we can apply. And it becomes more about like identifying what are the interesting data streams that we're trying to generate.

And then helping to build the auxiliary models that are needed to do clustering and embedding and all kinds of things like that in support of synthetic data.

Hermes Frangoudis (35:42)
That's huge. So it's kind of like really shifting the approach. But also kind of, feel like in line with how the industry is shifting its approach in general, like there's only, like you said, finite human resource, but I feel like there's finite human data. And for AI that's constantly hungry for new data, it's something that you have to solve for. Cause like what happens when you've scraped all the internet, right?

The internet is growing but not fast enough. The data sets are growing but nowhere fast enough for this thing to keep up.

Andrew Seagraves (36:14)
That's exactly right. And that's like one of the reasons like you can kind of see this in language modeling where because we've sort of hit limits on the scaling approach, maybe in terms of the volume of data that you would need to make the next incremental improvement in the model in pre-training. Now focus has shifted more towards post-training and using reinforcement learning to teach the model new capabilities in post-training.

And then leveraging test time compute, basically. And so the focus has really shifted. And that's just because there's not enough naturally occurring data that can be mined to keep making progress in pre-training. And so now that problem has shifted more to identifying problems where you can use humans or intelligent machines as verifiers in a reinforcement learning setting.

I actually do think that there's going to be a similar paradigm shift probably in audio. And it's not because, well, it will be more or less for the same reason that we're not going to be able to continue to make progress with very, very large scale pre-training and simply just because the audio doesn't exist to be able to do that.

Hermes Frangoudis (37:17)
It comes as a new challenge. But shifting away a little bit from challenges and maybe thinking about the customer. You guys work with so many customers and they do so many really cool things. Can you share a story where Deepgram's tech kind of made a meaningful impact for a customer?

Andrew Seagraves (37:35)
Yeah, totally. So there's a lot of examples that we could point to. I would point to one category of audio conditions, and then maybe I'll talk about one specific customer. So we touched on this a little bit earlier. Applications where you're trying to transcribe a human who is in the field, out there doing their job, or they're in the world.

And they're not in a studio with a podcast mic. And so there's all kinds of complex information sources around them. And a great example of this is air traffic control.

And a particular example of this would be like, you know, NASA, like recording and trying to transcribe in real-time what the astronauts are saying in their acoustic environment. So a very notable customer that we've had in the last few years is NASA. And we built models specifically for them. And, you know, this approach of like taking a general model and training it on a very narrowly distributed data set, even if that data set is complex, it's narrow enough, the model will work really well.

And that recipe applied and worked well. So we have the NASA model. So cool. But there's a huge range of applications where we're going to find similar things. When people are going door to door and they have a mic attached to them, anything like that, food ordering is another big one.

Hermes Frangoudis (38:39)
That's so cool.

People love to order food and there's only so many people that can answer the phone, right? And take the ticket. And I feel like there was a boom at some, at one point to like order everything online. And now it's come back to like, actually not just call the place, just pick up the phone. And they're not equipped for that kind of volume to just like pick up the phone and call you the same way they are hitting you online.

Andrew Seagraves (39:16)
That's true. That's, yep. That's one where the models are likely to work. They do need some basic capabilities. Like they need to be able to differentiate between if somebody says like number one or number two, you know, the, the item, getting the item right is the critical thing there. Yeah. yeah. Cause that'll just completely ruin someone's day. Up until like last year, like that was a real problem. You know, get like, when you hear from customers in the food ordering vertical.

Hermes Frangoudis (39:31)
Yeah. yeah. Cause that'll just completely ruin someone's day.

Andrew Seagraves (39:44)
segment. Like the model sometimes transcribe number one or sometimes they transcribe number three. And the reason is that like the reason why that might happen even though those words sound super different like and they they do for sure is that sometimes the audio is so degraded that what the person is saying is like what we what we say indiscernible.

So there's a ton of food ordering audio where the customer's audio is just indiscernible. And in that case, the model just hallucinates what it thinks it might be. So it's sort of just randomly picking number one or number three in that case. ⁓

Hermes Frangoudis (40:18)
It's like the person

at the drive-through, "Can I take your order?" "Yeah, I'll take a number one." "Number five!"

Andrew Seagraves (40:25)
Yeah.

That's what it would be. Now, yeah, so now it's sort of like you can imagine like in a voice agent context, the speech-to-text is hallucinated, the order number, and the agent says back to a number

That is like now happening in voice agent systems. Frustrating in

Hermes Frangoudis (40:41)
Frustrating in

a whole new way.

Andrew Seagraves (40:44)
Yep.

And another one that customers really dislike is if the model says the wrong, like says a menu item, but like from some other company, because it's been trained on like all this different data. And so if it's just sort of guessing at what it's hearing and it hasn't been properly like contextualized and scoped, like this is only customer A, and it produces a menu item from customer B, that's really bad. That's

Hermes Frangoudis (41:11)
That's

super awkward. That's like, did they pay you to say that?

Andrew Seagraves (41:16)
And they do have the freedom to do that.

Hermes Frangoudis (41:19)
Yeah, because they're there at the end of the day trying to use their "judgment," right? Based on the weights and things that they've been trained on.

So speaking of judgments and weights, what's your view on some of these more like open source alternatives? So we talked about Whisper. I think Nvidia has like Nemo.

Andrew Seagraves (41:41)
Yeah. I think that the quality of any open source model or like any speech recognition model is a direct reflection of the data that it was trained on. So there are like an open source model, there are a few data sources in the world where you have scale. And so, you know, there's like, for example, podcasts.

Hermes Frangoudis (41:39)
How do you guys see those?

Andrew Seagraves (42:02)
a lot of podcasts in the world where there are transcripts available, or you could potentially produce transcripts using a different speech recognition model. And so a lot of these open source offerings are trained on particular data distributions. And you might expect them to be good on those distributions and not very robust across the board.

Hermes Frangoudis (42:24)
So they don't work in real

world environments.

Andrew Seagraves (42:27)
They're

Definitely, yeah, I mean, like they're going to work. They may or may not work in the data distribution that they were trained on. And they're definitely not going to work in the real world. Where the data is like almost certainly outside of that distribution, you know? And there may be like a new way of training speech-to-text models that emerges where like you take a crappy off the shelf model that is like transcribes okay. And you put it into a particular application and like figure out a way to update its weights, you know?

Hermes Frangoudis (42:36)
Yeah

Andrew Seagraves (42:55)
Based on how it's performing in that application, like the task that it's associated with trying to perform. ⁓

Hermes Frangoudis (43:01)
So it's almost like

tuning the model for what you need to use.

Andrew Seagraves (43:06)
Yep. Tuning the model. I mean, you can imagine like a reinforcement learning kind of approach where you're using this model as part of a pipeline to perform a task. And there are errors in the speech-to-text model that lead to task failures. then there are times when it actually works, assuming it works at least some proportion of the time, where you get a task success. And if you could figure out a way to structure a reward

for the model and sort of back prop that through the speech-to-text model. It is possible that you could conceivably make a model better this way.

Hermes Frangoudis (43:35)
It's just easier to use Deepgram. I mean, why go to all that trouble when there are models

Andrew Seagraves (43:38)
I mean, why go to all that trouble when there are models

that do actually work? Yeah. Yeah. And I mean, the bigger, I mean, a bigger issue besides the model just doesn't work is that like, you have to orchestrate, like the real-time orchestration, like is very challenging, like being able to ingest audio, stream it through the model, produce inference results with very low latency and return them super fast. You know, that's a whole other...

it's like equally challenging.

Hermes Frangoudis (44:07)
And that's a little bit more towards the grail of what you want to be focused on, right? Like not laying the foundation of trying to.

Andrew Seagraves (44:14)
Yeah, exactly.

That is what a lot of companies like.

where they're able to have more impact. It's like figuring out how to deploy the model close to themselves, maybe on their own hardware, in order to minimize latency for their application. Yep, get the model as close as they can to where the audio is actually being generated.

Hermes Frangoudis (44:32)
Makes sense. It's like the ultimate edge is on device.

Andrew Seagraves (44:37)
That's right.

Yeah, of course, like a lot of the speech recognition models, you can make them small enough and quantize them. You can get them close to being able to run on device.

Hermes Frangoudis (44:47)
At least the voice activity portions that can then lead to the larger models that can do the transcription and things like that, right?

Andrew Seagraves (44:55)
Yep. Yep. That's right.

Hermes Frangoudis (44:58)
In terms of like, next question was going to be around, you see this becoming commoditized? But I think what we just talked about is like, no, because they're like, yeah, you can, you can run it on edge. You can run it on device, but there's always so much still to be done in the space, right? I think it's not quite to the point where it's going to be commoditized and consolidated.

Andrew Seagraves (45:22)
Yeah, I agree. mean, batch, you'd argue that like batch transcription starting to approach the point where it becomes commoditized across a lot of use cases, but real-time, definitely not. No, that's like a frontier, for sure.

Hermes Frangoudis (45:36)
So batch is gonna become like the staple. Like if you don't offer batch, what are you doing? And then everyone competes on the real-time, which would make sense because that's a thing that's constantly changing, constantly needs.

Andrew Seagraves (45:47)
Yep, that's right. And there are hard batch problems that I would say are unsolved. So the statement that batch is going to become commoditized. Well, but general, like like the general task. Yeah. But there are still hard problems in batch that I would say are unsolved. Like modeling multilingual speech and code-switched speech. So if the person is changing language while they're speaking, that's a hard problem for either real-time or batch. That's definitely not solved yet.

Hermes Frangoudis (45:55)
Well, but general, like like the general task. Yeah.

Andrew Seagraves (46:13)
And then another one is just being able to model the writing systems in all these different languages and have the model produce formatting that is appropriate for that language. And so the formatting of the output is another core problem that is not really solved, I say, across languages yet.

Hermes Frangoudis (46:29)
There's just so much to explore there. I mean, we could go into this like whole frontier thinking. I think that's probably like a better place to go than really harping on like product and strategy and landscape. I'm enjoying our conversation around like this. So 

Andrew Seagraves (46:42)
Thank

Hermes Frangoudis (46:49)
In terms of these intelligence systems that are doing the context switching, like, how do you see that kind of approach? And what do you see as some of like the missing breakthroughs that really need to be hit? Because when you think of a natural, bilingual person, right? Like myself, I speak Greek and English. And when I talk to my parents, I talk to my siblings, we constantly switch between languages unless

you know, there's other people around because for us, it's almost like in our brains, pick the shortest word, which word is going to like fit there, but also get the idea across without having to speak too much, which I could imagine for a speech-to-text would just be like the wonkiest thing ever. Cause you're just like mashing up the linguistics.

Andrew Seagraves (47:32)
Yeah, yeah, definitely. mean, yeah, so we call that problem code switching within Deepgram. And code switching is very hard because the way the problem is reflected in the way you just described how you and your family do code switching is that it is extremely localized to individuals and groups of people.

And like every group of people, like the way that they code switch is like unique to them. You know, the way they seamlessly switch between the languages and the words that they use. And they might like have new words that are like sort of a mashup of the two languages that they know, you know, but it's only the only they that use those words. So it's very localized phenomenon and localized phenomenon are hard to model. Like you need data to model. You need examples of like your family code switching

to really model it accurately. And so it's like a sort of a needle in a haystack kind of problem in that you need to find the examples of the interesting code switching patterns and then learn how to model them. And a model, like a really sound modeling approach. So I would say like, as far as how do you formulate the model? I mean, you actually like sequence models work really well.

like you can think of an LLM, like how do they build LLMs that can speak different languages? You have like a multilingual tokenizer that has tokens in it from all the different languages represented. And so it could in principle form any word across any language. So you do the same thing for speech recognition and you can train a single model on a bunch of languages and it can sort of just naturally learn to emit predictions 

in English or Spanish or whatever. Or both. And then the question is, if you give it code switch speech, what is it going to do? And if it's never seen code switched speech before, it's likely to just be silent because it's like a new data distribution that it's never seen. And this is like what the currently what the models do, you know, when you train them only on monolingual data and then they see code switching.

Hermes Frangoudis (49:17)
Or both.

They just get real awkward. They're like that guy that you start talking to in different languages. I thought I knew, but I have no clue.

Andrew Seagraves (49:40)
You know what?

the

It's really frustrating

because like that is the most interesting thing that you're trying to model. Like that's the phenomenon, like the real phenomenon.

Hermes Frangoudis (49:55)
Don't give up. Don't give up. Don't give up. Don't give up.

Andrew Seagraves (49:56)
Don't give

Venture a guess.

Hermes Frangoudis (49:59)
So aside from just like multilingual, like context switching, that sort of thing, there's a whole another set of detection that these models need to do. And that's around like more general, like distinguishing humans from background noise, maybe like who is the main speaker? Are there other speakers to be able to like attribute the audio to each person, right?

Andrew Seagraves (50:21)
Yes, I would say that that is one of the big frontier problems that when we solve that one, there will be tremendous utility to be derived from it, like real audio intelligence. Building a model that is as good

at listening as a human is, you know? Listening to another person when they're in a complex environment and not only like understanding what they're saying and then being able to sort of like filter out the noise intelligently. And then care, like understanding not only what that person is saying, but what's the state that they're in, you know? Are they in their baseline or are they a little upset? Are they a little sad?

Hermes Frangoudis (51:03)
Emotional

Emotional context. Huge right? Because there is no emotional context and now we see with like TTS. They're able to accept emotional cues so they can respond with what seems like an emotional response or mimicking response, but if you don't have that data coming in those things aren't gonna match right? Like if I'm really upset and you're being sarcastic with me

Andrew Seagraves (51:06)
context.

Hermes Frangoudis (51:29)
which happens in the real world, right? Like those are real examples, people just kind of egging into, or leaning into something that's upsetting someone just to kind of like troll them. And I feel like that kind of exists more on the internet than proper behavior, right? Like people trolling, people love to troll on the internet. But that's a huge issue, because if someone's upset, you don't want the model trolling them, you want them consoling.

Andrew Seagraves (51:39)
Mm-hmm.

Yeah, totally agree. You have to be, the model has to be capable of really characterizing their counterparty's emotional state, having some reasonably deep understanding of it, and then being able to respond if we're talking about like a speech-to-speech model, being able to respond appropriately, condition on what the model is detecting. And the way that people have attempted to solve building this audio intelligence model so far is that they're taking LLMs

as a starting point because LLMs are smart,  LLMs know all kinds of stuff and they they're like  they really understand the content of what the person is saying and then they're trying to like graft on to that

some kind of acoustic representation or acoustic encoder. Like very frequently people take whisper as an encoder. And then they do some like fine tuning, some post training on that to graft whisper on or some other audio encoder. And you get a model that does not really understand audio. And there's some like, there's some interesting work about like examining the representations that these models produce internally. So you start to get the sense that like this real audio intelligence model really needs to be built sort of from the ground up.

And trained on some task or some collection of tasks that includes both audio and text. And discovering what those tasks are, I would say, is like the big challenge.

Hermes Frangoudis (53:12)
Yeah. Cause from, they only have like so much data to go off of, right. And traditionally it's humans, organics, whatever you want to call it. Like when you're talking to someone, usually it's not over the phone so much. So you can see their face. You can, you have other cues that can play into your decision-making. Model doesn't get that. It just gets what it gets, right?

Andrew Seagraves (53:35)
Yeah, I mean that raises a really interesting idea, which is like maybe the way that the audio intelligence model will be trained, it'll be like, it'll be bootstrapped from a video intelligence model. You know, like maybe that's one of the fundamental signals that you'll be able to sort of like extract knowledge from as a starting point. Yeah.

Hermes Frangoudis (53:52)
Right.

Like get the, get this multi modal one that kind of can take the two things, create the right tunings between them and then distill out from there. It's an interesting idea. Yeah, it's possible. Yeah, I mean LLMs and AI have proven damn near anything's possible. Like things that people five, six years ago told you, no, there's no way you could do that. It'd take too much effort. And LLM can do it like in an afternoon, right?

Andrew Seagraves (54:04)
Yep, it's possible.

Hermes Frangoudis (54:22)
It's insane.

Beyond just like emotional context and like these really complex things, there's also like fundamental issues like in conversational dynamics that just haven't been solved, right? Like how do you really understand when someone stops speaking? Like that end of turn detection is so critical, but also like the start of turn, right? Like the start of speech, is it an interjection or is it an affirmation? And like,

how are you guys thinking about these sort of dynamics in the conversation?

Andrew Seagraves (54:55)
Yeah, we're thinking a lot about them. So I would say that there are a number of models, a growing number of models that are now shipping with end of turn detection. And so people are currently trying to solve the problem of understanding when a person has actually finished. And then they're able to then apply a heuristic that says, if I think that this person has stopped speaking, it's now

probably a good time to start speaking. And so then it's like, okay, we allow our voice agent to respond like as quickly as it possibly can after we think the person has stopped. But that ultimately will lead to conversational experiences that feel weird. They're not gonna feel natural because humans are probably doing some end of turn detection when they're listening to someone.

But they don't always wait until that person is done. They interject. And so there's this other signal that you mentioned, which is like, should I start now or not? And that signal is much, I think, much harder to model because it's more, it's it's subjective. It's like, is this the right time or not? And that, I would say, is like not really characterized yet.

Hermes Frangoudis (56:05)
I feel like people don't really get that one right. Like you said, people don't wait, they just interject. Like people in their own brains are like, idea, idea, I have to share, I have to share and just like jump in and butt in like I did just now.

Andrew Seagraves (56:19)
Exactly. The butting in seems like it's so situationally dependent and also dependent on the individual that it's like way more subjective and potentially harder to model. And so you can imagine like if you had a very broad corpus of data where you have humans talking to each other, one thing that you might need to do to be able to build the start of speaking model is to really characterize all of the people in that corpus as to like whether or not they're good at that.

You know, like who are the people that we actually want to mimic? And who are the ones that we don't want to mimic? And we can use both of those, you know.

Hermes Frangoudis (56:49)
Yes.

I feel like, like you said earlier though, it's situationally dependent, right? Cause in some situations it's really taboo to interject, right? Versus other ones, it's almost expected. Like if you're having a heated debate, the other team is just going to like ramble on until you interject to put your points and kind of like take, derail that part of the conversation. So it's like, how does the model know when it's contextually and socially appropriate?

Andrew Seagraves (57:25)
Totally. It's going to be like, it's more of an alignment problem getting that part right. It's like, we will be able to train the start of speech signal using human conversations. And you'll get a model that exhibits all kinds of different behaviors. It'll be kind of wild, you know, and then you just got to rein it in and get it to do what you really want, you know, in the context. And so it is kind of analogous to like the LLM alignment stage. I think we'll have that for a start of speaking.

Hermes Frangoudis (57:52)
I mean, that would be pretty cool just to see, right? Like the interesting to have that. Cause right now, let's say you put two models, like voice to voice models in a room together, right? And I'm sure you guys have done this. We do it all the time at Agora. It's fun way to test your LLM. And they start talking to each other, but they have a very civilized like discourse. They go back and forth very like almost like a show, right? Like theater.

They take the turns. They're very careful. And they don't mimic natural conversation in the sense that if someone starts talking the other direction, they just give them their space to

Andrew Seagraves (58:29)
Yeah. When you just said that I had an image in my mind of like a puppeteer who's like doing two characters and he can, you know, this puppeteer can only do one of the characters at once. And like, that's kind of what a bot on bot experience feels like when you listen to it. Yeah. That's right.

Hermes Frangoudis (58:46)
And kind of going into that whole, the style of speaking, right? think there's some that we talked about the other day is another major challenge is like getting into  adapting to how an individual speaks, right? Like how, how they kind of linguistically phrase things and speak in a way that's not like,

the normal path, right?

Andrew Seagraves (59:10)
Yeah. Yeah. I mean, that's, that's another like emerging challenge. I think that there's like a couple of areas of deep learning that where, where that kind of challenge is addressed. And like, these are the kinds of techniques that we're going to have to bring to bear. And, you know, the idea of personalizing a model to a particular person that's definitely at play here and then federated learning is another one. And so like, you know, if you have a model.

If using your phone and you have some model or some system deployed on your phone, federated learning approaches are gonna be able to update the local weights of your model so that it works better for you. And so we'll have to devise clever ways that the person who's using the thing can give feedback and give a feedback signal that says, actually, I meant this

and have that experience be seamless enough that they're willing to continue to use the thing.

Hermes Frangoudis (1:00:05)
I feel like that's being very generous to the model and the experience because most people continue to like be like but I meant this and continue to mispronounce it and say it like and then they're gonna go through this like whole roundabout way of describing what they meant but the TTS isn't gonna get it right like it's not a human

Andrew Seagraves (1:00:23)
No, and the success of this will all hinge on like, are you able to update the weights of the model in the moment once you get that initial feedback so that you get it right like immediately. ⁓

Hermes Frangoudis (1:00:34)
So like self-adapting,

self-adjusting. And is that stuff that you guys are like working on, you have in production now? If you can talk about it

Andrew Seagraves (1:00:43)
No,

Yeah no, is something that's like a core problem that...

I would say like not many people have worked on in general across the board. And it's, you know, one of the core problems of like the next phase of Deepgram. Yep. Models that adapt. And we've been solving that problem through training up to this point, like training highly adapted models, like offline in an offline setting for customers. We had a lot of early innovations, like if you only have an hour of data for a customer, how do you get the most accurate model possible on this hour? You know?

We ideally like to have 10,000 hours. And so you can do all kinds of things in this low resource setting, few shot setting. It's kind of the same thing, but for the real-time adaptation, some of the mechanics are going to be different. You won't be updating, let's say, the weights of the models. You might be able to update the activations that were produced. be able to edit them in real-time. Take a different path

Hermes Frangoudis (1:01:43)
take a different path

down to get there, okay?

Andrew Seagraves (1:01:46)
And then you store the signature of that update that you did, and then at some later point, you incorporate that into training. these edit activation, editing approaches, they exist. There's a handful of people that are working on this. So it's kind of like a newer thing.

Hermes Frangoudis (1:01:59)
So it's kind of like a newer thing.

Super cool. We have been going for so long and I want to be mindful of your time because you are a VP of research and probably have way more important things to be doing than talking to me. But I appreciate you taking all this time to answer all my crazy questions and go down these rabbit holes with me.

Andrew Seagraves (1:02:20)
Yeah, totally. I had a great time chatting with you. ⁓

Hermes Frangoudis (1:02:23)
We'll definitely

have to do this again, like, cause I have so many more questions now and I'm sure I'll have even more questions the next time we meet. But right now I have one more question, you know, like that Steve jobs, like just one more thing. If you weren't working on STT, what other like voice AI domain do you think you'd be betting on next?

Andrew Seagraves (1:02:35)
Mm-hmm.

Yeah. So I think the problem of modeling emotion is really interesting. And that to me is like one of the frontiers that like if I was still building models, you know, and if I have some parents spare time here and there, then like cursor has gotten good enough that like when an hour I can get spin up a model, like I would like to be working on probably emotion recognition or modeling it in an unsupervised way.

And trying to utilize some ideas about most of the time a human is in a baseline state. And then every once in a while, they're in an excited state. And I think that, just to say one idea, that might work. If you look at text-to-speech models and the way that they're trained, if you take a text-to-speech model, the first stage of training, text-to-speech models are trained in the same way that speech-to-text are these days.

A very, very large scale training first, and then a much more refined fine tuning post-training. But if you look at the model after the first stage of training, it has seen millions of speakers in its training, and it doesn't really represent any of them particularly well. So it gives you this averaged expression, averaged human expression.

But it's not very emotional, you know, it's not like, doesn't exhibit a lot of extremes. And so like a TTS model that's pre-trained on a very large corpus of data with weak conditioning kind of represents like the baseline state. And so you could take that and you could construct like a difference operator that takes some audio of a human and then it takes like a TTS simulation of that human. And then it compares the act and embedding of the actual audio to the synthetically generated one

Hermes Frangoudis (1:03:59)
Mm-hmm.

Mm-hmm.

Andrew Seagraves (1:04:23)
in some embedding space, and there will be examples where that human is animated, and the TTS will not be capturing that. And so you should be able to detect this difference.

Hermes Frangoudis (1:04:33)
So it would be like the differences in kind of like the pitch and inflection and like all those other characteristics that aren't really there when you see text on a page, right?

Andrew Seagraves (1:04:42)
Yep, exactly. yeah, mean, so this is like literally what I would do, I guess I have described to you.

Hermes Frangoudis (1:04:48)
No, it's cool. We don't have to give away all the secret sauce. I think it's a really awesome

approach and concept and who knows, maybe someone's watching and will let you do this.

Andrew Seagraves (1:04:58)
Yeah. So, you know, give it a shot. Let's see what happens. It's likely to not work like for a variety of reasons, you know.

Hermes Frangoudis (1:05:00)
Yeah, that was fun.

I

 that's what experimentation is about, right? Is it gonna work? Is it not? It's only one way to find out.

Andrew Seagraves (1:05:11)
Exactly.

Hermes Frangoudis (1:05:11)
Well, Andrew, I appreciate your time. I thank you so much for all these like, I can't even call them tidbits. These are like knowledge bombs. Like you've kind of like created these like explosions of just like extra information in my brain that I'm so thankful to have. And kind of better understanding of all this like speech-to-text patterns.

Andrew Seagraves (1:05:33)
Well, I'm glad I had a great time chatting with you. So, yeah, thanks a lot.

Hermes Frangoudis (1:05:38)
Well, for everyone that's following along, like, subscribe, follow along with the podcast. We'll be releasing this episode in the coming weeks and keep an eye out for when we're going live for the next one. Appreciate everyone's time and see you soon.