Ricardo Herreros Symons (00:00) The bar that is being set of speech AI that of what humanity can achieve. not that of an individual human. Hermes Frangoudis (00:04) That's something that most people just don't think about as being like a crucial issue. Ricardo Herreros Symons (00:09) A quick response is great, but a quick response to the wrong thing? Game over. Quantity is a quality in and of itself. I think humans are also changing how they interact to work with the AI. Hermes Frangoudis (00:27) Hey everyone, welcome to the combo AI world podcast where we interview the companies and leaders pushing the voice AI space forward. I'm really excited today for our guests from speech Maddox. Thank you so much, Ricardo for joining us. So let's, let's jump right into it. I always like to get a little bit of the background story. So tell me a little bit about how speech Maddox started and how you got involved with the company. Ricardo Herreros Symons (00:40) It's a pleasure looking forward to it. Basically, spent far too long studying to my detriment, studying potentially useless things as well, medieval European literature. But by the end of my degree in Cambridge, I was looking for a way to stick around for one more year because I picked up rowing crew, I think you call it here. I wanted to compete in the boat race against Oxford. So I needed another degree to tack on. And so in Cambridge, can do like a one year master's degree in his management. And as part of the management degree, you do like a project for two or three months at the end of it. Most of those projects, you go work for a big bank, you write a report, they look at the report, they throw it in the bin, done. But there was one that was just this company, SpeechMastics. had one employee, basically myself and a couple of other rowers on the team got involved for a couple of months. We made a sale. Again, it was like $30 kind of thing. there was no, the product wasn't in the cloud. SpeechMastics was just like the trading, like the the trading entity with nobody there. So I got involved with that and then stuck around afterwards and basically myself and another person on one of the MBAs there along with Tony, sort of our original founder, took Speechmatics from pretty much zero to where it is today. So it was a university project in that sense. Hermes Frangoudis (02:03) So you've come a long way with the company and looking back, did you ever imagine it'd become what it is today? Ricardo Herreros Symons (02:08) So when I first met Tony, the first thing he said to me, and again, numbers have inflated as they always do over the last few years, but the first thing he said to me, I remember I met a man in a big t-shirt with a big booming voice and he just said, he kept saying a hundred million. I wasn't sure if it was a hundred million what, dollars, pounds, whatever, revenue, but at that point it was a hundred million. And we were sitting there with nothing at that point. And obviously nine out of 10 startups fail. But yeah, I'd say I'm not surprised. There was always like an underlying expectation that it It would go somewhere. The technology has always been really good. So from that perspective, I'm not surprised. Hermes Frangoudis (02:42) I agree with you there and we were talking earlier about how the speed of things is changing and I guess for our viewers and everyone watching, can you maybe explain a little bit about what Speechmatics does in plain English? Ricardo Herreros Symons (02:54) In plain English, if there is audio out there, we will find a way to make it machine readable. We'll turn it into text. This can be used for live captions. It can be used for uploading sort of media asset management files. It can be used for powering voice agents, medical scribes, meeting notes, basically anywhere where there is audio that needs to be interpreted in some way by machine. We will take that sort of that amorphous voice data and turn it into something that is understandable, tokenized. and can therefore drive insight from it. Hermes Frangoudis (03:25) So zooming out, there's so many different pieces that voice AI can touch in speechmatics. What are you most motivated in terms of like solving with voice? Ricardo Herreros Symons (03:35) So overarching goal is to be able to get the technology into as many parts of people's lives as possible. always talk about sort of success for me is when I think into sort of a startup like this, a lot of people, I'd often get not mocked, but sort of, okay, you've chosen like a slightly weird path. You keep going on about speech recognition. I was always sort of talking to my phone, sort of always using the Google interface. This is back in like 2015, 2016. And everyone was like, can you please just stop being annoying? And so my mission a little bit was like, know what, I'm going to make sure that in your everyday lives, whether it's sort of you're on a support call at work, whether it's watching, you're watching live sports on TV, whether it's sort of, you're dictating something underneath the hood, speak mathematics technology is going to be there. So that was sort of the overarching goal. But then from a sort of a more, I guess, vocational perspective, my younger brother, Juan Antonio, sort of half brother lives in Spain. He's had like various learning difficulties throughout his whole life. He's 21, he's 25. He basically can't read and write. Doesn't speak English, obviously lives in Spain, but also can't really read and write at all. His world opened up when the voice interfaces with telephones enabled him to be able to just do little things like search YouTube for the music he likes. And he lives in that world now and is very much an example of somebody who never really had the capability or the opportunity to be literate, as in he could have spent years sort of... in class and in school and just wouldn't have got there. But through speech technology, his life is so much richer for it. And that's just on the input side, you add text to speech for that in the conversations. I just think about sort of across the board being able to like reach those people who can take the interaction with machines to our most basic form of interaction to speech. And that's really exciting. So that for me is like the personal element to it. But the other sort of personal element as well is like, let's just get this stuff everywhere because it just makes me proud to see it. Hermes Frangoudis (05:29) No, it is in terms of the interface. just removes a certain level of barrier to using technology that has been there since the beginning of computers. You always had to be a literate person. guess early on you had to be someone that understood how to speak to machines in their language. Then it became someone that can click through the interface and type now with voice. Everyone who can speak can now communicate. Ricardo Herreros Symons (05:53) voice and with LLMs, right? Because now we're at the next stage. don't really, you still need to be able to code, but that will go away. And now the beauty is that sort of with the power of just plain speech in pretty much any language, you're going to be able to transform the world or at least have the opportunity to. So there is a great level of democratization, but also just of speed, right? As in it doesn't matter how, how fast you type sort of how, as in it is just more efficient to be able to speak things through. I'm sure maybe like once the Neuralink stuff gets there and we're thinking our thoughts into our phone, that will be the next stage. But for this period of time, speech has to be that answer. Hermes Frangoudis (06:30) In terms of, we were talking about this earlier, people often assume that speech recognition and speech is kind of like this solved problem. What do you have to say to those people? ⁓ Ricardo Herreros Symons (06:40) I reckon I've been told speech has been sold pretty much every year since about 2017, 2018. And don't get me wrong, it is sold for more and more use cases, but what is the long tail then becomes sort of the high volume. People's expectations grow, stakes as they sort of move into sort of higher risk and higher stakes industries become more important. So there is still a long way to go on it. Sure. If I want to have a sort of a simple, I mean, this conversation, right, this conversation is pretty well miked up. not a lot of background audio, of accent wise, there's nothing that challenging. The areas where we will fail here will probably be if I start mumbling, if I start using some funky vocabulary, so sort of things that aren't in the dictionary, the diarization is going to be picked up off each mic separately. You could do a pretty decent job of this, but this isn't the real world. I mean, this is the real world, but it's not fully, this is like a small representation, but there are so many different things that you could add onto this where the problem still remained unsolved. To be honest, the hard problems are the funnest ones to have because they show the ability to really be able to differentiate. And of course, even if even sort of solved in and of itself, much like we're seeing with LLMs, there is still a question around solving things efficiently too. So you've got LLMs that think really well, but if they're costing you an absolute load of tokens, then that's not usable every day or sustainably, or at least at some point, but obviously they're making those systems more efficient to deliver the same quality of output. there are so many things still to still to attack. But I also fully take the point that for, for a lot of use cases, you can get pretty good with a probably model that's sitting on device. That long tail keeps on growing as well. Hermes Frangoudis (08:18) So it sounds like the low hanging fruit is a bit more solved, but then as you mentioned, when you get into real world scenarios, noise, background noise, multiple speakers, mumbling, it changes the game. That's not the solved problem, Ricardo Herreros Symons (08:34) Yeah. And accents and dialects and then sort of like this request, like, ⁓ people want things to be simple. They want multilingual models. They want models where you don't need to say what language you're speaking in, what the accents are of the speaker. They can just turn it on, press go. And it understands everything that in and of itself, that drops the accuracy. If you have a model that runs a hundred languages, it will reduce the accuracy a little bit. So then how do you raise that to sort of the quality that you would have got from a a single language model beforehand. If you want to sort of go like use case specific, you could use a model for a certain use case or now people are training models with LLMs on the backend where you can sort of access like a different section of the overall model itself. Again, there's, always going to be something which is driving the innovation and the need, the need for improvement. Hermes Frangoudis (09:17) That sense. you touched about earlier, you talked a little bit about transcripts, but it's not just about the text to speech as a transcript. It's about extracting more from that, right? And can you tell us a little bit about how Speechmatics is looking into how you extract meaning and things? Ricardo Herreros Symons (09:33) Yeah, I mean, you start off with the text and text is beautiful because obviously all the LLMs have been trained on text. You can do lots with it. It's very easy to sort of to put safety nets in it and sort of say what you shouldn't be doing. But beyond it, there's a richness to speech and a multimodality to speech as well that there's still like a long tail of the way to go. So how is something said? Who said it? Like, can you affect it? A lot of these sort of real time voice engines still don't diarize very effectively. So they don't say which one of us is speaking. should, we be, I mean, I think one of the, at the moment, if you're having a call with a, like a financial institution and you're using a voice agent, technically, if any other voice is heard in the background, that voice needs to authenticate itself because it then invalidates the fact that somebody else could be coercing you into doing something. so like the ability to know who is speaking is really important. Likewise, on the emotional side of things. being able to understand not just sort of, I happy? Am I sad? Am I frustrated? There's a lot of power in all of that information, but also going sort of deeper into sort of like what's the state of mind. You go into the long tail of healthcare possibilities. You could solve for that too. There are so many things, so many paralinguistic pieces of information that sit beyond just what the words are that will drive a richer conversation. And like when we talk in like on the voice side of things. We're looking at some voice activity detection, we're looking at turn detection. Is this end of us? What's the thought going through here? What type of latency is it? So there, yeah, I could go on about it forever. Hermes Frangoudis (10:58) None. I would love to hear about it. It's a very interesting space when you think about you mentioned diarization, but in the in the sense of the real world scenario when you're talking about financial systems and like you said, the user has to authenticate themselves. That's something that most people just don't think about as being like a crucial issue because in that moment, the voice is picked up. It's put into the transcript. There's no way to say, that wasn't there. It was ignored, right? So this is something that if if it was just human to human, they would be able to understand, that's just some background noise. But because it's an agent model, they would pick it up as if it's the person who's having the conversation speaking. So what is something that you think humans do effortlessly where the speech AI kind of still struggles with? Ricardo Herreros Symons (11:44) So I think there's a little bit of a paradox here because actually speech AI does a pretty good job, but the bar that is being set of speech AI is not that of an individual human. The bar that's being set is that of what humanity can achieve because I can't have, like, I speak to different levels of qualities, six languages, maybe at a push, we could do this. It'd be a pretty dull conversation if it was on language number six, don't get me wrong. And if it was in Greek, then I'd probably just be saying hello, good morning and sort of where's the beach. But we're expecting these models to able to speak 100 languages, understand every dialect and being able to take all this information. So actually we're not expecting the models to be as good as a human. We're expecting the models to be as good as all humans. And that's where they will, they will struggle a bit more. I think like dialects, age, accents across the board, there are things that are just a little bit more difficult. mean, but on the flip side, I know people now who, when they are speaking to somebody from another country who perhaps has a very strong accent. will put the speech or text on and the speech or text will do a better job of understanding them than they will because that speech or text has been trained on millions of hours of labeled or unlabeled audio and therefore is able to do a better job there. I think it's more a question of the expectations being higher at the speech technology because they need to reach not only what one individual can do, but what everybody can do. Hermes Frangoudis (13:01) That's very interesting. So the bar is not what one person can do, but what all of humanity can achieve and understand. Ricardo Herreros Symons (13:08) I think so because you're not calling out that this engine can't just solve one problem. Obviously you'll use different models, but it needs to be able to solve the capabilities of a specialist in lots of different areas. So take our medical models, for example. Our medical models will transcribe any sort of highly, highly medical interaction better than most humans, because most humans haven't heard half of the terms or drugs that have ever been mentioned. Obviously, specialist radiologist is going to be great at that and they'll know how to spell them. They'll probably be looking the stuff up online as well. that sort of that level of expertise is actually beyond 99 % of humans already. It won't be maybe at the, it won't be as good as the very, very, very best humans doing it, but it'll get pretty close in this situation. So there's an expectation of specialism in every, in every domain, which we put on these models and fair enough. They've got more time to train. Hermes Frangoudis (13:59) Well, it makes sense. And it feels like there's also gaps in a lot of these industries where they just don't have the people, right? Like, because it takes such a specialization and years to get there and the AI can fill that gap for them. Ricardo Herreros Symons (14:11) Well, yeah, mean, the beauty of it is the AI can train up so quickly now that one, you can train the AI to then train the employees and make them experts. I mean, I was having a chat with, I'm not particularly well versed with the US healthcare system, but I understand that everybody applies for insurance in the same six week period, of in October, November. Hermes Frangoudis (14:32) depends by your company, but yes, there's like periods where everyone in that group will sign up. Ricardo Herreros Symons (14:37) Exactly. And so these traditional contact centers, they're having to hire a ton of seasonal staff to up the number of people to be able to take the requests about health insurance. And a lot of these people will be doing the job for one day, two days, and then not doing it again or never doing it again. And so they just can't possibly be experts. But actually, you've got sort of a highly sophisticated LLM, which has been trained on the information. One, that can be used to train the others. And two, it's probably going to get you 90 % of the way compared to the people who've just started there. So again, it's... an example where like this high availability of specialization at the beginning is really able to help everyone get a better experience. Hermes Frangoudis (15:16) So interesting. I'm sitting here like, when you think about that, just the demand from a business perspective, it's not about like, can we just get someone to fill it? Can we just get an AI? This is an AI that is meant and purpose built for solving your issues. And you mentioned earlier, that's this idea of like, the bar is what humanity can understand, right? So it's like understanding like every voice. Where did this mission kind of come from? Is that like a speechmatics thing or is that just a personal? Ricardo Herreros Symons (15:46) I think speech classics understand every voice. came from the fact that we saw that in the early days of speech recognition, of the, in general, again, the first thing to be solved was the traditionally accented American white male voice. I don't know you can unpack what that means in many ways, but we'll just use it as a proxy for now. But we spend a lot of time training our system, creating self supervised and representational models on loads of different audio so that In fact, sort of Hungarian model would have millions of hours of, I don't know, something like Dutch training data in it, because we want something that was truly global, something that could understand people across the board, understand people from every corner and availability of sort of access to this sort of, guess, to data in general. And so it was just a very natural place for us to start. think as well, probably just that the company has its history sort of in Europe as well. Europe is a melting pot of languages. mean, the US is also melting pot of languages now, so obviously more recently, but just that simple, simple mantra that you can't just create this technology for the people that have been traditionally sort of recorded the traditional training data. You want to make sure that this is something that anyone can pick up and use. Hermes Frangoudis (17:03) So really diversifying and bringing things that would assume like accents. Like that's probably an area that's been traditionally a blind spot for, I guess, maybe a lot of speech to text companies. Cause you're saying they train on that stereotypical, these are the people that record themselves. Ricardo Herreros Symons (17:20) Exactly. mean, back in the day, if you just trained on all of the news data that was available, people spoke in a certain way. And if you didn't speak that way, then there's going to be less training data for it. But the beauty of it is that you can use unlabeled data as well, of which there is a lot more of it, to train these models to better understand. mean, there are so many tricks actually you can now do to sort of to get them to become truly global. But yeah, the days of two people with very similar accents talking to each other, sort of... are over in the global economy and so you need to make sure that everyone has access to. Hermes Frangoudis (17:51) Have you seen customers that are genuinely like surprised with what the AI systems will understand? Can you tell us some stories about those surprises? Ricardo Herreros Symons (18:01) Yeah, reckon still every week somebody is shocked by what this is capable of. mean, it is sort of a bit of a trend right now. Like LLM has felt like magic. They still feel like magic in terms of what they do. because speech is like the gateway to that magic, it's the gateway for the lay person to LLM magic. And so that wow factor is able to be achieved consistently. We find that I think the thing that probably gets on most wows at the moment. So I do like a little, like a medical rap where I'll just, I'll say, right, let's try and create. As again, you say speech is solved for easy use cases. Okay. Let's make a difficult use case. So start saying particular medical terminology in a difficult accent and then have that being understood as a different speaker as well. And then not being interrupted because it's a different speaker to that. That is what really drives wow moments. So our just, if we're talking to an AI agent, which is powered by speechmatics and I say to it. Hey, know, ignore Hermes, right? Just only respond to me. Be sitting in the corner there. And then we can have a conversation and then I can say, by the way, what do you think about what Hermes said? And just from your voice pass, if I asked the question, it will respond. If you ask the question, it can ignore you or vice versa. And so being able to use the diarization in a very sort of human way, that thing really garners a lot of, lot of wow moments, as well as the way in which we build like our custom dictionaries out. So We can rebuild our model at runtime with thousands of extra words that weren't there. People just love, ⁓ that's my company. That's my name. That wouldn't have appeared beforehand. Things like that still create a lot of wow. Hermes Frangoudis (19:37) It's just the ability to like retrain on these words that have company names and specialized names. That's super interesting because I hear from other people who've done interviews on the podcast. It's that ability to say certain words right without having to use like a phonetic library of sorts because no one uses a phonetic library, right? Like that's not natural. So the ability to have that wow moment because the technology has made it so easy to... to bring it forward is really cool. Ricardo Herreros Symons (20:09) Yeah. it's currently the most of the systems out there. If you try them, they sort of, they're fixed for customizing the dictionary at runtime is it's a sort of, it's like a, it's more of a prompt. And so it doesn't have a lot of depth to it. You can do it with maybe tens of words, but it really falls off a cliff. Whereas we can add thousands immediately, all your product. Cause I assume you use meeting notes. You probably like a granola or something like that. The challenge with that is that you probably have even internally at Agora, you've probably got, don't know. 20 different projects with weird names that are half made up. Half of your customers names won't yet be well recognized because here in the Valley, every business is only about two minutes old. And so the ability to be able to sort of get everything that sits in your CRM immediately sort of put into a bank of uploaded directly into the system so that whenever you're talking for any internal meeting with about any business that you work with to get all of those transcribed properly, makes a massive difference. Hermes Frangoudis (21:05) Huge difference. these models are, multimodal. So taking audio in, they're getting text out, but it's not just the text as we said earlier, right? It's more about getting that rich audio to understand insights about it. And so can you tell us a little bit about like what kind of insights companies are getting from speech? Ricardo Herreros Symons (21:25) Yes. mean, in terms of what they're getting from speech right now, the majority driver is again linked to what the LLMs can achieve. It really is just how can you lose as little information from the audio data? So getting beyond just the text, getting who said it, how it was said, the hesitations, all that kind of thing, and then powering whatever use case it might be. And again, this could be anything from, from ambient scribing to voice agents to big sort of cohort analysis, but Realistically, at this point in time, the most exciting thing that speech is doing is giving a seamless interface to the LLM magic. Hermes Frangoudis (22:01) it becomes almost too frictionless in some, some. Ricardo Herreros Symons (22:06) Yeah. I mean, I think there's still, there's still like a little bit of friction in terms of, don't feel like we're quite there on computer use, right? In terms of you want great technology, but then you want the great technology to be able to integrate into all the systems in a safe way. And obviously with LLM there's a real challenge there. But just making sure that the best in class systems can integrate into these LLMs effectively. I feel like that's, still a little bit, not far away, but I really want to be seeing everyone being able to. essentially code run that it's the Jarvis from Iron Man type thing, right? There is no reason apart from sort of like the quality of the speech to text is good enough. There's obviously an efficiency question as well. turns me out to run this without it costing absolutely tons, but it is still less than the LLMs. So how do we, it is just an implementation question of being able to get to that point where we are all voice driving our agents. Hermes Frangoudis (23:00) What's an example of a speech insight that you'd say has a big impact on business? So like something that, let's say, if I was just using the text alone, wouldn't give me as much information as using the audio data with the speech or with the text to come out. So is it sentiment, intent, feeling? Ricardo Herreros Symons (23:23) I think there's still a little bit to go on the audio sentiment side of things. Again, everything's moving fast and I'm sure between us recording this and it going out live, sort of things will have moved on there too. But sentiment does feel where there is a little bit more of a way to go. So I'd probably double click on the diarization side of things that that is one where it can give you that that next level of information, whether it's anything from authentication to useful summaries being able to pick up which speaker said what is so key when you're looking at meetings, especially said look, especially here we've got different channels, but when you don't have different channels, just being able say it was definitely this person who said this for the entirety for it changes the quality of the of the summary and the insights afterwards, if you can know who said it. Hermes Frangoudis (24:07) That completely makes sense because when you're talking about everyone's on a digital voice conference, very easy to decipher because you know where the streams are coming from. But once you start mixing that in with real time people in person, people online, that becomes a lot more of a muddied water. Ricardo Herreros Symons (24:24) Yeah, loads more. And it really, that is really a challenge that is still really hard to solve. And I don't think there are many solutions out there implementing diarization as well as they. Hermes Frangoudis (24:33) could be. And your team does diarization in real time, right? Yeah. Yeah. So how much does speed matter then when you're dealing with these sort of real time conversations? Ricardo Herreros Symons (24:44) Speed is pretty fundamental. It depends what you're using it for. And at the moment, the biggest drag on the speed of an interaction, because actually you don't want an interaction to be too fast. in the worst thing possible is that an agent barges in. So you're talking, you take a slight pause, and within a second, the agent is already trying to answer. Then you get into this really unnatural sort of, I'm going over. What me? No, horrible, disgusting. We tend to feel that... an end-to-end sort of one speaker to another, you probably want one 1.2 seconds between the two. You also want to be using sort of some end-of-thought or end-of-utterance models to be understanding that too. We provide sort of disfluencies, so ums and ahs in what we're saying as well. ⁓ Just like that. Again, I'm pausing a little bit there. I think if you can give more time for the LLM, the TTS and the other models to do the thinking, then that is valuable. There is always a requirement for the speech to be faster and faster and faster. That real time factor is really important. I don't think like time to first bite makes any sense because you see these, you see these evaluations being run where they're saying, well, time to first bite. That base just means the time from the end of speaking to the first, first sort of response that is given. But in a lot of these systems, they game them a little bit. And the first word that's given back is just a random like two or something. And it's wrong. It's just the wrong word. Time to the first word is not a useful metric. Time to the first. word, whether it's correct or not, you want time to the first correct word. And that is what we try and focus on as much as possible. put accuracy as the above all goal, but we've been, we've probably shaved about half the time off our latency so far, but we still a little bit way to go. We are not the fastest model out there. We are within range of the fastest ones there, but we sit on that Pareto frontier of the latency versus accuracy. have the highest accuracy there and the latencies are just a tiny bit behind, for many of the use cases, really, really good. The bit that's driving all of this though is the, is sort of the function calling. If you're going out to a database that slows things down. And so that's what drives interactions longer. So that's why you want the speech to text to be as fast as possible. Hermes Frangoudis (26:53) That makes sense. So time to first bite while it is an interesting metric is not really capturing what is truly the time to first like correct response. Ricardo Herreros Symons (27:04) We tend to, for us, we genuinely believe that the time taken between the end of the final and speech use of speech models give out a partial result, which is sort of like a, this could change. And then a final, we're confident that this is correct. The gap between the final and then being able to sort of have the turn ending there. is the key metric you want to be driving down. Now I've heard some fun stuff around how you can, I mean, we play around with this as well, but can actually sort of, it depends how much money you've got to spend on tokens and LLMs, but you can technically have the model sort of, I speak and as I'm speaking, you are constantly passing off to the LLM to say, get ready to respond, get ready to respond. And you can do that based off the uncertain final or off the partial. But if you're doing it off the partial, then there's a good chance that if I said, I might say, is this, is this right? But then I might say, is this writing or handwriting? And so write R I G H T would change to write W R I T I N G based on what I said afterwards. So that, that change means that that partial right would change to writing final. And if you'd been preparing a response based on is this right, that it would be wrong. And so you'd have sent a bunch of stuff off to the LLM and have to go, nope, this is wrong. Throw it in the bin. And then you'd be needing to send another, another sort of a, another call out to the LLM again for response that eventually gets expensive. But even compared to last year, the cost of LLMs They've dropped significantly. So actually you can do more things. even know people who are looking at trying to guess what's how I'm going to finish the sentence. Your brain might've been putting sentence in that or the phrase, whatever. And they're trying to, they're actually sending the requests off to the LLM based getting the LLM to guess the final word before I say the final word. So the latency is almost negative. Hermes Frangoudis (28:50) Interesting. using that partial incomplete, they're leveraging the LLM's ability to predict and be like the fanciest auto complete ever, hoping that what it predicted was right to predict the correct response out of it. Ricardo Herreros Symons (29:07) Exactly. So I start talking, you think, you know, what I'm going to say. The LLM says, okay, he's probably going to say next. We're going to put next in there before he says it. Here's the response that we're going to have to that. And then if I don't say next, you throw it away. If I do say next, then you keep it and it's a faster response. Hermes Frangoudis (29:23) Do you think at some point the latency is too fast and it becomes like unnatural? Because we all know unnaturally slow, right? Like I finished speaking two to three seconds later, it responds, it feels like a walkie talkie. But is there uncanny when it comes to being like too fast in that? Ricardo Herreros Symons (29:39) Absolutely. And so the reason the speech to text needs to be fast is not to get the quickest response. It's to give the time to be able to give the right response at the right pace. Cause too slow is a pain in the neck, but too fast is awful. So you need fast speech to text, but purely to give the ability to respond if it needs to. But the answer should actually sometimes be, you know what? I'm going to pad out this response a bit longer and wait to respond. Cause there is nothing worse than the interruption. yeah, but that, the ideal situation is have the time in the bank to be able to play with to decide if you then want to respond. But if you've got these things responding in under a second, it's awful. that also, I you can also play around with that based on who the demographic is of the customer too. So I know we were working with one company and they were selling, they were selling sort of high quality wines via voice agents. People who buy high quality wines tend to be a little bit richer and a little bit older, potentially sort of speaking at a slur rate, more thoughtful. less so fast out the gate. And as a result, they got a system responding to them in one, 1.1 seconds. They haven't had time to think. They want one and a half. They want two seconds. And so you can get all of this really fascinating user feedback based on sort of the different demographics. And that could be age, that could be language, it could be culture. So it all comes down to the fundamental need to have the time to decide if I want to respond, but even more importantly above that. I need to make sure the response is accurate. So the speech to text being right in the first place is where it all begins because a quick response is great, but a quick response to the wrong thing. Game over. Hermes Frangoudis (31:16) No, makes sense. even with Agora, we see that making sure that the LLM gets all the right data at that time is like, keeps the conversation going in the right direction. Otherwise it goes wildly off the rails. The speechmatic team works across like a bunch of different industries, whether it's healthcare, finance, media, being there from the beginning. Which of these industries really kind of surprised you the most? Ricardo Herreros Symons (31:38) We started off in media because it was easy in terms of available data, high quality recording environment, that sort of thing. We actually stayed away from healthcare for quite a while because Nuance was a player in the space. They famously liked to litigate anyone who started showing that they had a good engine out there. But I would say it is now healthcare, which has surprised me the most too, because traditionally you think of healthcare and you think slow moving, lots of regulation. Hermes Frangoudis (32:05) the Ricardo Herreros Symons (32:05) Like not necessarily a quick buck to be made, but there's this requirement now, as I think mentioned earlier, these clinicians, these physicians, these healthcare professionals, they are so time poor. The cost to serve these, the cost to serve like the AI models has dropped. And now you've got this fantastic sort of world in which B2C applications are so good and they are saving so much time and efficiency that they are forcing the big regulatory bodies, the big behemoth, the big providers, the technology. to think and act like startups at the risk of otherwise losing massive market share. So the speed of adoption in healthcare has genuinely surprised me and it's great. Hermes Frangoudis (32:44) Healthcare is one of those industries, like you said, you think of it as being very traditionally slow to adopt this technology. What do you think are the biggest frustrations with speech AI or that what are the biggest frustrations you see speech AI removing from that healthcare industry? Ricardo Herreros Symons (32:59) So again, it's the less interesting work. The ambient scribing use case, there's the sort of, you have a consultation, you have all of your notes put into your electronic medical record EHR system immediately, actions followed up, you can just review it. That is the absolute clearest frustration. No healthcare professional likes or enjoys doing that. It's not... It's not high value work. Obviously you want them to be able to check things. And I think with everything with AI, it's like, the AI to the work and then make sure the work is good. So that is really important, but that is just the clearest use of technology where saves time, saves lives. Hermes Frangoudis (33:37) So healthcare is one of these regulated industries, or similar. Finance, what worries customers when it comes to voice data and speech technology within these industries? Ricardo Herreros Symons (33:47) Lots of things worry people across the board and understandably so. I'd say where speech technology benefits is that again, because it is the gateway to LLMs, generally the nervousness around LLMs is bigger than the nervousness around speech. So in the UK, for example, if you want to, there's this whole thing where if you're an ambient scribe and an LLM is summarizing, then the summary is technically sort of a medical decision because you're deciding what to put in and what not. decision therefore needs to be certified very specifically. there's like a type 2A and type 2B medical devices if you're doing something like that. Speech itself is just a transcription. It can piggyback off the fact that actually a lot of legwork has been done on the regulation side already in getting the LLMs certified, that the speech is actually a slightly less challenging problem to put in there. Obviously there are questions about keeping people's voice data and on-premise. HIPAA compliance, things like that, they're the route to it. But in general, it's been far planer sailing than we probably expected. Hermes Frangoudis (34:49) Have you seen teams change how they work once you introduce something like a speech technology because it makes, does it make the interaction searchable? Like how does that impact the dynamic of the team? Ricardo Herreros Symons (35:04) As I guess it depends which use case we'd be looking at, but I mean, yeah, this technology can, can like shave, shave hours off people's days, depending on what you're doing. We work with video editing platforms where they upload an audio file and then they can, they can search through via the audio. They've got the closed captions already there. It's just done on the, on the healthcare side. We talked about ambient scribing, but the other side of it is obviously all like booking, appointment booking, consultations, monitoring people. You can have speech engines monitoring hospital wings, searching for anything that's being said. I can't even begin to talk through how many ways in which this can transform people's lives, but yeah, it's very real. Hermes Frangoudis (35:49) Because of how many ways it can impact and improve the business, what would you say from dealing with customers, speaking with businesses, what really convinces them to move from doing like a pilot to a full deployment? Ricardo Herreros Symons (36:02) So I guess traditionally when you're selling software, you're either trying to like reduce costs or increase revenues. think the argument for voice AI in something like support is just so black and white. The ROI is so big there. You can have fewer agents, you can reduce your whole time. You can have the training, have better interactions. Again, coming back to our discussion, I'll mention of getting insurance instead of US healthcare. There are so many levels in which You can reduce costs there that it makes sense, also you can then you can, we're seeing like outbound like BDR teams being replaced by, by agents here too. There's, there is one of those areas that is not only efficiency, but genuine real, real value add as well. So in some ways I'd actually say the world is shifting a little bit too. It's pushing more towards like ongoing POCs. So traditionally you start with the customer, they sign up for a few hundred thousand dollars. Great. Now what we're seeing is people, want to, they want to get an engine. They want to sign up for a smaller commit, 50 K whatever build, but they're still sort of in POC phase. It's not until the second year, like the full year of working together that they're finally out of POC. In the meantime, they'll have given you a bunch of revenue. They'll have used you lots that have had all the hours there, but people really want to sort of like embed and continue to evolve the technology within their workflows. So I'd say actually the convincing from moving from A to B people. there's expectation people will test and implement sooner, but actually to go to what would be a traditional commit model takes longer, but in the meantime, they're driving those baseline metrics anyway. it's different way of doing it, but it does, I guess it gives, it's more of an us to continue developing and to then really build those deep relationships. Hermes Frangoudis (37:48) So the pilots just go from small pilot to bigger pilot to bigger pilot and naturally grow into this. Ricardo Herreros Symons (37:54) Yeah, then at some point that pilot's delivering a couple of million dollars and they're this is probably not a pilot anymore. Hermes Frangoudis (37:58) Yeah, this is a full deployment. Let's take a little bit of a shift here and talk about how do think people's relationship with voice technology will change over the next few years? Ricardo Herreros Symons (38:11) So I was saying at the beginning that I used to always talk to Google and I used to get mocked for it. Now it's the other way around. And so many of my friends, colleagues, peers, they, to their LLMs, they dictate their voice messages. We're just going to be using it more and more and more in my view. Obviously there's still the sort of, you're in a noisy environment. And obviously culturally as well, see traditionally voice messaging is much more common sort of in countries with character-based languages. because it is a bit of a pain to write. It takes, it takes a little bit longer, especially if you're using it on a phone. But me personally, I hate voice notes. I despise voice notes. I say, you send me a voice note, I'm not going to read it just because I'm always sort of around people and there's always something going on and getting a moment to be able to listen to voice notes, real pain. If somebody dictates the message, fine. I can read that. I've got it in a sort of an accessible format. If I could speech to text all my voice notes, that'd be great. But yeah, think we are just, I've already seen the shift of people interacting with this and I think it's only going to go one. Hermes Frangoudis (39:16) It's super interesting because it felt like the early technology was kind of pushing up against it. You're like forcing it to try to do things now because of LLMs. It's a lot more accessible and like you said, it's growing. Do you think the voice interface will become just like an expectation or will it still be like a very prominent feature? Ricardo Herreros Symons (39:38) I think if you don't have a voice interface, then you're just going to be moving more slowly, realistically. It should still be, you should have different options, but it's just going to be inefficient to not have one. might be, and sorry, just onto your point there as well. The LLMs have made it a lot more accessible, but the other thing that's made it more accessible is actually, I mean, in SaaS, you don't want to see costs going down, but the fact that speech to text has reached a level of cost where you will just put it on in the background. for every meeting, for every interaction means that the volume that is now transcribed has just gone absolutely through the roof. So that has been another level of accessibility increase. Yeah, of accessibility, not accessibility. Hermes Frangoudis (40:20) Yeah, just making it accessible to those teams in a way that's not cost prohibitive. that makes sense. So it's always going to be there. And now it'll be there even more because it'll be easier to build in without draining your resources. What excites you most about where speech AI is headed right now? Ricardo Herreros Symons (40:38) I As in, it's the fact it is heading everywhere. Hermes Frangoudis (40:41) The ubiquity. Yeah. Ricardo Herreros Symons (40:43) It's just, it's like the most exciting thing is, yeah, I guess people talk about different qualities, right? And I think there's a quote from Napoleon. He used to brag that when he was going to war, like France was able to sort of get that their recruitment of soldiers was so great that they just had more soldiers than everyone else they were fighting in the 1800s. There's a point where like quality, is a quality in and of itself. People always say quality of equality, but like, if you have enough, at itself is a quality. And I see that for speech, the volume of it is the exciting part, the ubiquity, because there is so much to go after. And it's just going to, it just means it will transform everything. It's going to create more data and it's going to be a virtuous loop and it's just going be great. Hermes Frangoudis (41:32) What's a voice AI capability that feels close but not quite there yet? Especially for someone like you that's really embedded into this. Ricardo Herreros Symons (41:40) speech to speech conversations feel great. They are, they are moving closer and closer, but for like for regulated enterprise grade, not there yet. You still want a cascaded approach in my, in my view, having that sort of that central pillar of text to be able to control things to guard rail is really important. And I'd also say the, on the emotion side as well, it is getting there, but I still don't feel like I've not, I've not had an interaction with, with a system that is able to really, I mean, we're known for dry sarcasm in. in England as well. I haven't captured something that seems to that really captures an understanding of how I'm saying something effectively. Hermes Frangoudis (42:18) Makes sense. The, ⁓ not only intent, but the inflection of it all. And it's not quite there. When you, when you say speech to speech, that also gives me some flashbacks because playing with speech to speech models, you realize very quickly that the audio in and out doesn't always match the text out. Right. And in highly regulated industries, you just can't, can't have that level of deviation. Even if it's one word, it brings into question everything. So. Yeah, I'm going to stand firm with you on the cascading. believe that's really like, we're going to continue to see speech in speech out, but how that happens in the middles. Ricardo Herreros Symons (42:56) think speech to speech will come. I don't think it's far away, I don't think it's going to make a big impact in 26. The way I see it being used is quite interesting. Speech to speech gives you a really natural, nice experience. The cascaded models are moving that way too. Speech to speech gives you great low latency, but I see a world in which people are honestly switching out between models. Maybe the beginning of the interaction is a speech to speech model. works out what you're trying to be doing. And then it moves out to a cascaded model. We've seen that with a few with a few of our partners, but for the real enterprise grade, great stuff, speech to speech not there yet. Hermes Frangoudis (43:32) So we're coming towards the end of our interview. I really appreciate a lot of your time. I appreciate all your time for this. One of the questions that we leave towards the end is a little bit of a wild card. usually I'll ask people like, if you weren't doing voice AI, what part of AI would you be working with? But given your background, how deep you are in this, I feel like you'd always be in that area. So what do you think is one thing about human speech that AI researchers still don't fundamentally understand? Ricardo Herreros Symons (44:00) So I think on the human speech side is on the emotion side of things, it's definitely a big part of it. But I actually, I'll flip the question slightly. I don't think it's what the researchers don't understand, but I think it's actually, there's going to be a weird bit where speech models will change, but we're sort of going to meet in the middle. So I think humans are also changing how they interact to work with the AI. the example I give when my fiance's sitting there chatting to Claude or GPT. She doesn't speak like she does to me. She's far more polite. the way people speak in general is again, coming to that sort of that end of turn, end of utterance stuff. If you're making a request, you say, Hey, I'm in San Francisco and I'm wondering where I can find tickets for the ball game later. Could you also check the, and you'll see, go into this sort of not giving you a chance for the AI to respond because you're You don't quite know what you're asking, but you're elongating the way in which you speak for the ⁓ AI. think there's a ton of things that AI doesn't understand about humans. There's also the fact that humans are making, are naturally sort of evolving how they interact to make the AI understand them, which is a really funky paradox. I don't know how I feel about it. Hermes Frangoudis (45:22) It's interesting. not only is it like we're working on what we've traditionally known as the way we people interact with each other now, the way they interact with the LLM is adjusted to almost, I don't want to say fit the interaction model, but it's adjusted to that interaction model in a way that is not as natural. And so like it changes, right? Super cool. Well, thank you so much for your time, Ricardo. And thank you for everyone. Ricardo Herreros Symons (45:43) It's changing us to fit in with the technology. Hermes Frangoudis (45:50) listening. We're going to be on the social media thing, so please like, subscribe, follow for more and we'll see you on the next one.