Hermes Frangoudis (00:07) Hey everyone, welcome to the Convo AI World podcast. In this podcast, we interview builders, developers, and teams that are building solutions for the conversational voice AI space. Today I'm joined by Ziyi Lin, a speech engineer from the TEN team. Welcome. Thanks for joining me. Ziyi Lin (00:25) Yeah, thanks for having me. I'm glad to be here. Thank Hermes Frangoudis (00:29) Everyone's familiar with TEN. We met with Ben Weekes in our first episode, but for those that are not familiar, TEN stands for the Transformative Extension Network. It is a framework for building, multimodal voice video based AI agents. And today we have one of the core engineers from the team who is focused on speech. So I'm really excited for this one and let's dive right in. So Ziyi, what problem was really the catalyst? Like, like what was the reason that the team decided to create this open source VAD rather than continuing to use an off the shelf solution? Ziyi Lin (01:11) Well, this question is quite good. We all know now is the AI agent or let's say conversational AI era. So with the rise of AI agent, the VAD, the Voice Activity Detection has become even more crucial. It's become more more important nowadays. So we need systems that can quickly spot the SOS, the start of sentence and end of sentence, in short, EOS with lower computational load and smaller library size. Also, the VAD system has to handle false negatives and false positives well. And so in other words, the VAD system should have high performance and should really be agent-friendly. And in addition, it should be able to detect the short non-speech segments between the two separate sentences in order to lower the end-to-end existing VAD solutions just aren't cutting it in some aspects. So I will take the old school VAD system as example, the traditional energy-based VAD or the pitch-based or GMM-based VAD from WebRTC. They are not robust, against noise. Thus, resulting in high false alarm rate. So we cannot use it in the voice AI agent and the SOTA deep neural network-based VAD called Silero VAD has high latency. Well, the latency may not be tolerable. So this is why we decided to develop our own open source VAD systems. Hermes Frangoudis (02:54) Makes sense. It sounds like there were a lot of really strict requirements. You said, you know, from start of sentence, end of sentence. Those sort of things are so critical. I've come to notice like the more I play AI agents. It becomes apparent that sometimes if you don't have a really good VAD for it to detect Ziyi Lin (02:59) Yeah. Hermes Frangoudis (03:18) that like end of sentence, it kills it. Like you're sitting there going, did you hear me? It's like the old Verizon commercials, right? Ziyi Lin (03:23) I can hear you. Yeah, for example, if the end of sentence cannot be detected quickly, then the users like you, you are using the AI agent, are always waiting the agent to give you an answer, give you a response. But after one second or two seconds, you didn't hear any response from the agent. You would be angry and say something continuously or after. Then the experience is not good. Hermes Frangoudis (03:52) yeah, I totally, I do that. Huh? Yeah, the experience is not as good, you're right. All right, the off the shelf stuff was not cutting it. And the idea is like, all right, we're going to make our own, right? So walk me through this, like, was there a pitch? Or a prototype? Or what kind of convinced the team to be like, yeah, we're going to go this way. We're going to do it ourselves. Ziyi Lin (04:16) This is very easy, very simple. Just because our VAD has better performance comparing to the SOTA one, the SOTA deep neural network-based called the Silero VAD. Yeah, and we have better performance. We have lower false positive arrays, we have lower miss detection rate. And also we have smaller library size. Our library size is just several 100 kilobytes in size. And on Linux, our library size is just about 300 kilobytes, but for Silero VAD, it is about 1000 kilobytes or 2000 kilobytes. Yeah, and in addition to that, we also have lower latency. We have tested the latency of Silero VAD. It is about also several milliseconds. And we only have dozens of milliseconds latency or a hundred milliseconds latency. So we can have better user-agent interaction experience than Silero VAD. So we have higher performance, we have lower latency and we have lower library size in summary. Hermes Frangoudis (05:26) No, that totally makes sense. And I think it's that super compactness that really makes it appealing, right? It's really lightweight. You're talking about at least or at smallest, a third, like right, if you're using a 1000 kilobyte versus like 300 kilobyte. And then if you have a 2000 kilobyte, you're talking about like a sixth of the size. That's orders of magnitude smaller. And you're saying it responds faster. It can detect better than the Silero VAD which is kind of like the standard right? That's what most people are using now. So looking back at the setup and when you're planning the development of this, what early design choices did you make that really made that big difference in hitting the low-latency targets while still being able to have like the model size? Ziyi Lin (06:19) First of all, let's talk about data. Let's talk about the training data. We actually use precisely manually labeled data for training other than the open source data because the open source data provides our label, but the label is in a low precision. So we cannot use this data for our model training because the precision is so low. The label is not correct. I mean, the cheapest is the most expensive. If we just use the open source data for training, then our VAD model will learn the latency just from the low-quality label. And therefore, the experience will be worse due to high latency. So from here, we can see that VAD plays an extremely important role in human-agent And secondly, we use a physical feature. We call it pitch. The pitch is a very important characteristics of our vocal cords. We use the pitch or let's say the fundamental frequency in the input feature. This is the second point. Hermes Frangoudis (07:25) So the first real differentiator was the data. It all comes down to that core training data, right? Like the better the training data, the better the result. Ziyi Lin (07:38) Yeah, because Yeah, because this is a data driven approach. I mean, the deep neural network, deep learning era, right? Hermes Frangoudis (07:45) So you had high precision, really well labeled data. That was something that you could train on. And you're saying by adding in the pitch, you're able to like differentiate better as to what is start a sentence, end of sentence. Like how does the pitch play into this? Can you maybe unwrap that a little bit? Ziyi Lin (08:05) Yeah, because Yeah, because pitch is an important characteristic of our vocal cords. Only when our vocal cords vibrate, it will generate pitch or fundamental frequency, noise or... Hermes Frangoudis (08:19) No, no, I totally understand what you're talking about. Like that pitch. But when you're talking about in the sense of the activity detection, right? Like how does pitch or like the pitch data play into that detection? Like, can you kind of walk us through the understanding of it? Ziyi Lin (08:39) Because only human voice, human speech contains pitch, but noise does not. And the model can learn these characteristics. If the models see the pitch is less, for example, 100 hertz, then the model will think maybe someone is speaking. If the pitch is zero, then there must There should be no one speaking. There should be noise or other voice. So that's like a way of Hermes Frangoudis (09:07) So that's like a way of detecting the background versus actually like what the person is saying from a model's perspective. Right, like, because the background noise, something that is not a voice will have a constant pitch or no pitch at all. Sorry, no pitch at all. And then if there is pitch, this is voice. So that's like one of those differentiating characters in the sound space. Okay. Ziyi Lin (09:19) Yeah. Yup. So pitch is very important in the whole model Hermes Frangoudis (09:33) That's very interesting. I had no idea. When we're talking about, pitch is perfect for this because we'll lead into.. Really what kind of algorithms or signal processing tricks allow this VAD to really be really robust across all the different diverse inputs right like microphones, environments. Is it just the pitch and the training data or is there something even cooler about this that Ziyi Lin (09:36) haha Hermes Frangoudis (10:01) we can dive into? Ziyi Lin (10:02) First of all, the training data, we collect the training data in many scenarios, in many rooms, room configurations, also with many different kinds of microphones, many different distance between the speaker and microphone, and different reverberation. Also, the pitch feature you mentioned. But last, we use multitask to train the model. So our VAD model not only does the voice activity detection task, but also does other tasks. And these tasks will help the VAD task to learn better. And to help the model to better detect speech and noise. Hermes Frangoudis (10:48) So it's almost like the algorithm is self-learning as it's going. So like even as it's working, it's improving its understanding of the voice versus noise And it's interesting. You say like you tested on different input or you trained from different inputs, different environments. It feels like the scale of the training data is probably bigger than the scale of training data most other companies have access to. So when we talk about VAD, VAD doesn't really operate in a vacuum. It's the critical first piece in TEN's cascading model, right? Yeah. Like The VAD is what like triggers the downstream LLM and the TTS modules. How do you manage that state in real time for the VAD to be Ziyi Lin (11:25) Yeah. Hermes Frangoudis (11:37) so precise than that. Ziyi Lin (11:38) Yeah, VAD essentially acts as the traffic controller here by detecting both the start of sentence, SOS, and end of sentence, EOS, as I mentioned before. It is what triggers the entire chain in the real time. So for example, once VAD picks up the end of sentence, like 200 milliseconds silence, this time we can adjust based on our needs and it signals the STT system, so the speech-to-text system to finalize the transcription. And this transcription will then be sent to the LLM and the LLM outputs then triggers TTS to synthesize a response. So, this is why low VAD latency is very important in the conversational AI application. Hermes Frangoudis (12:28) Yeah, it's like that first, like you said it best. I've never heard it said that way, but I really like how you put it. It's the traffic controller. It's like, "let's go, let's go. You're coming in, you're coming in. We're done. We're done". That's it. Time to process, right? End of sentence and end of speech. Let's get it moving. But another critical point of that traffic controller is also to be like, "Oh wait, stop this whole process. You know, we have, Ziyi Lin (12:36) traffic controller Yep. Hermes Frangoudis (12:55) We have an interruption". So VAD has to like, you know, the start of sentence, right? Ziyi Lin (13:01) Yeah, the start of sentence controls the or affects the interaction latency, I think. And the EOS affects the whole end to end or response latency. So it's very important. Hermes Frangoudis (13:18) Yeah, it's such a critical piece. And then... When you see TEN being run, right? And this VAD, is it more edge or full cloud? Like how important is it where you run it and given like dependent on the use case? Ziyi Lin (13:35) The VAD model is essentially, it's basically very small. It is very small, only several 100 kilobytes. Typically, it can be run on edge devices to ensure low-latency detection of voice activity. Yeah, and And users hate delays when interacting with agents. Running VAD on the edge just cuts down the lag because it is right there on the device. Detecting speech starts and stops instantly, immediately. And another thing is about the bandwidth and the cost savings. Yeah. If the VAD is on the edge, it only sends audio frames that actually have speech to the STT, speech-to-text or at the ASR system, whether that's on the edge tool or in the cloud. And it can help you or help the users to save the bandwidth costs, the STT costs and keep things efficient. Hermes Frangoudis (14:38) Yeah, that's huge. The STT cost is something you don't want to like mess around with, right? Because you don't Ziyi Lin (14:43) Because you don't want to just send the noise or send the non-speech segments to the STT systems because it is not meaningful. Hermes Frangoudis (14:54) Exactly. Yeah, nothing would come out of it. You're just burning tokens. In terms of earlier you spoke about low latency, false positives that sort of thing. What kind of benchmarks are we running here to you know, how do we compare these with the alternatives? And cause I know we said it's faster. It's better. How do we know that? Ziyi Lin (15:17) To be honest, there are many open source datasets, but with not precise enough VAD labels. If we just use these datasets for evaluation, it may lead to wrong performance results. In order to address these, we built a very large dataset combining our real-world recorded data and the filtered version of open source data. What does it mean, the filtered version? We did manually label the data. We didn't use the original label file of the open source data because it is not precise enough. So by using our precisely manually labeled data, the evaluation is more reliable. So in our large test sets, we have lower false alarms, lower miss rate, and larger area under the curve, larger area under the precision-recall curve, which is a very common metric in the classification problem. Comparing to the SOTA open source VAD, Silero VAD, we have better performance, we have lower library size. So when running the open Hermes Frangoudis (16:23) So when running the open source on our datasets, we could see the performance difference. And then even when labeling the public datasets, we still outperformed them. Ziyi Lin (16:33) You can just go to our GitHub site and you can see the PR curve of our TEN VAD and also Silero VAD and also the pitch VAD from the WebRTC. You can test by yourself and also with your data. But the point is label. We should do manually precisely label. Hermes Frangoudis (16:52) That's the big key to, that's the secret sauce, I guess, right? Yeah. And because it's open source, we can kind of share the recipe. The recipe is make it from scratch. Ziyi Lin (16:56) Yeah. As I mentioned before, the cheapest one is the most expensive one. But it's also the most performing, right? You should put some effort on it. Hermes Frangoudis (17:09) but it's also the most performing, right? Yeah. I, I mean, I'm going to take your word for it on this one. When it comes to this accuracy, right? How do you maintain that accuracy when users are switching between like speech, music, maybe the background noise? Like the person is maybe moving or there's things moving around them that are causing the environment to change. Ziyi Lin (17:17) and I mean these scenarios are all in our training set. To be honest, our training set contains data with switch between speech, music or background noise during a session. Thus, the model can learn to switch between different scenarios and give the correct voice activity detection. And our training data containing many kinds of scenarios, not only these, thus, our model can adapt to different use cases, different scenarios. Hermes Frangoudis (18:05) That's huge. All comes right back to that core dataset, but just not going to escape good data, robust data, well labeled data. If people aren't getting the hint now data. Ziyi Lin (18:22) As I mentioned before, data driven approach. Hermes Frangoudis (18:24) Yeah, yeah, it's, it's all about data and Ziyi Lin (18:28) Yeah, and Yeah, and also the pitch feature, our model design, our loss function design are also critical. Hermes Frangoudis (18:34) They all play, they have roles in supporting in this. So this thing's light. It's like 3, you said 300 kilobytes. So when you're talking about like CPU performance, because the CPU has limited, like the smaller the better, right? Like it's not a GPU, it's not doing parallel processing, you do a lot more serial, maybe you get some threading in there. Ziyi Lin (18:35) Yeah, yeah, Yeah. Hermes Frangoudis (18:55) So when you're talking about running this on a mobile device. Is there going to be a trade off there with like CPU and detection precision or is this something that's going to still perform? Ziyi Lin (19:06) Good question. But honestly, we didn't have to make huge trade-off between CPU usage and precision on mobile. And I will explain to you why. VAD is basically a simple binary task, right? It is just the classification of two classes, speech and non-speech. It just needs to tell if there is a speech or not. So that lets us keep the model tiny, only a few hundred thousand parameters. So even on mobile, it barely use any CPU at all. We can always run our VAD on mobile and it doesn't affect the CPU performance a lot, just a bit. You can totally ignore it. And basically the task's simplicity let us avoid that tough choice between efficiency and accuracy. It runs just light on mobile, but still catches speech reliably. So in summary, don't worry about the CPU usage since the model itself is simple enough. Hermes Frangoudis (20:15) That's awesome. So I like how you put it. It's either it's a yes or no question, right? Yes, keep going. No, ignore it. So, and we all know mobile. That's like the ultimate edge. That's the closest you're going to get to the user to being able to run something right in the user's hand. So that's awesome that it can be deployed on mobile and have virtually no impact on Ziyi Lin (20:20) That's good. Yeah Hermes Frangoudis (20:41) CPU. Do you guys have feature requests coming in from the community or not really? Ziyi Lin (20:46) The feature requests... I think no feature request from the community right now. Yeah, it's pretty new. Maybe after several months. Hermes Frangoudis (20:52) No, yeah, it's pretty new. Okay. We'll check back in. So we're gonna skip down to like the frontier thinking. TEN VAD has kind of like solved this challenge, does it a lot better than the market alternatives. What are the unsolved technical hurdles here? Like can we solve overlapping speakers or maybe whispered speech or you know multilingual? Is that on the horizon for VAD or does that go somewhere else in the whole ASR? Ziyi Lin (21:28) I think VAD is only responsible to detect whether there is a speech in the audio frame or not. And the VAD itself is multilingual. We can use the VAD in any kind of languages because we are all the same. Our vocal cords are the same. Hermes Frangoudis (21:44) It's very distinct, right? Like it's human vocal cord sound versus any other animal or any other sound. Yeah. Ziyi Lin (21:51) Yeah, yep. And also, overlapping speakers seems not a task of VAD. VAD just can detect whether there is a speech. But the VAD will not know if there is two speakers or three speakers or many speakers. It doesn't Hermes Frangoudis (22:17) It doesn't really care. It just says, this is speech, pass it to the next step, right? Ziyi Lin (22:19) Hmm And the whispered or weak speech detection is a big challenge. For example, when someone speaks softly like whispering, haha, or yes. The audio signal is extremely low, it's extremely low in energy. Yeah, this makes it easy for VAD to mistake it for non-speech and miss it entirely. And with the develop of deep learning, the model itself can learn the characteristic of whisper speech. The model can detect the volume. Hermes Frangoudis (22:41) you Ziyi Lin (22:58) Although the speech is in a low volume, the frequency structure, the spectrum structure still exists. So the model can extract these features out regardless of the volume. Hermes Frangoudis (23:20) So it can do that now. The current one does this? Yep. That's super cool. Ziyi Lin (23:22) ⁓ yep you can try you You can try our model, right? Hermes Frangoudis (23:29) Well, no, I've tried it. I think this is more just so for the people listening along to kind of hear this spoken out, right? Seeing is believing and we're gonna, bet we're gonna include some links to the hugging face, to the benchmarks. We'll make sure that this is all included when we air this live or air the recording of this. So we'll make sure that everyone has access. But for right now, I'm going to pick your brain a little bit more on some fun stuff. So when we're talking multimodal agents and we're adding video, how might the VAD collaborate with like the visual silence detection to improve that turn taking? Will the VAD play a role there or again, will that be another piece of the puzzle? Ziyi Lin (24:14) I think VAD and visual silence detection can help each other, but they need to work together. For VAD, VAD checks the sound when someone starts or stops speaking. And the visual part, the visual checks what we see. Like if their mouth is moving, like when you are speaking, your mouth is moving, then together they are better. Yeah, for example, if VAD hears silence but the camera sees their mouth is still moving. We know they are not done and that stops the agent from interrupting. It's too early. Yeah. But there are also problems. Yeah. Sometimes someone stops speaking to think and then VAD will hear silence like this and their mouth might stay open. And then the visual part can cannot tell if if they will keep going or in a noise place, VAD might think a noise is a speech, so the misdetection, but the visual sees their mouth is closed. So that fixes the mistake. Hermes Frangoudis (25:29) They work together to like correct Ziyi Lin (25:31) So, yeah. The outliners. If one makes a mistake, then the other part will detect better. Hermes Frangoudis (25:31) the outliners, yeah. Ziyi Lin (25:41) We need both VAD for the sound and visual for the looks. They can help each other, so get it right more often. It is not perfect, but better than just using just one. Hermes Frangoudis (25:55) No, it totally makes sense. Humans aren't perfect at understanding when people are done speaking either. I've totally been there where you're on a video chat or a video call and maybe there's a little lag. So the person stops to think plus the connection is a little laggy and you start to talk and then it starts to come in. You're like, "Oh, no, that was not my turn". Right? And that's just human to human. So with AI, it's super cool that there are these efforts to kind of Ziyi Lin (26:16) Hahaha Hermes Frangoudis (26:25) meet the human to human interaction there, right? Like there's always gonna be that little bit of mistake, but the closer we can get, the more lifelike and the more. The more we want to interact, the better experience, right? Ziyi Lin (26:35) Yeah, in addition, if we wanted to detect a turn better, maybe we should use some semantic model to detect the semantic information. For example, you are saying, "Hello, how are you? How is-" and then you stop. And then you're thinking, ⁓ what I'm going to ask, then you... then you say, "how is it going?". The semantic information is, the semantic is finished. So the other speaker, the next speaker can react. Semantic formation. Hermes Frangoudis (27:16) Yeah, no, the semantics are huge. Yeah, it plays a huge role into it, right? So we even having a great conversation here and I appreciate your time, but we are nearing kind of like the end of our chat. And at the end of all of my conversations on the Convo AI World podcast, I like to ask a little bit of a wild card question. For you, if you had kind of like a free opportunity to just hack on one moonshot feature for TEN or TEN VAD like the speech elements in TEN, what would you build? I mean, from like a speech engineering, it doesn't even have to be from a TEN perspective, right? If you, as a speech engineer, let's say you're not working on TEN you're not working on anything, you can choose something really cool that you want to focus on. Just what would that be? Like, let's say in another life, you aren't working on TEN, you're working on whatever you can dream up. What would that be? Ziyi Lin (28:30) I would like to build a model that can... that can detect the emotion of the speaker, that can perfectly detect the emotion of the speaker. Because if you just use the transcribed speech from the STT system, it is just test, just a test. No emotion information at all. You will never know if the speaker is angry or is happy or is what kind of emotion. So emotion detection is, I mean, it's very important in our AI agent. Otherwise, the agent will talk to you just like an agent, not like a human. Yeah. Just a machine, just like a machine. Hermes Frangoudis (29:16) Yeah, I think that's actually like a really interesting point you bring up. That, that emotional context, that emotional awareness. And I think I've maybe seen one or two demos where they get that right. And it's, it's uncanny, right? They understand and they inject the emotion back into the, the AI and it has like the awkward "ums" and, "uh" and, it feels human. And I don't know how to explain it, like that emotional connection takes it to another level. Ziyi Lin (29:44) and yet Yeah, you're right. Hermes Frangoudis (29:53) It changes the interaction completely. Ziyi Lin (29:55) Because now the AI agent is just talking to you, but it is difficult to understand your emotion and gives you the appropriate response to you because if you are sad, the AI agent should feel your sad, I don't know how to say, sad emotion. Yeah, it should be able to empathize, right? Yeah and should be able to say something to let you calm down. Yeah. Hermes Frangoudis (30:23) Yeah, it should be able to empathize, right? Yeah. And now there's in the TTS I've seen, you can inject like emotional stylings. So that becomes even more important, right? So understanding my emotion to be able to continue that chain with their emotion or their perceived emotion Ziyi Lin (30:36) Yeah, Yeah. Yeah, Yeah but there is also many problems with the TTS system because if we use the large language model or large model in TTS system, the system will behave non-stable, not very stable. Hermes Frangoudis (31:03) What you mean by that? Ziyi Lin (31:04) The TTS system will say something that you don't expect. For example, if you want the TTS system say, "Hello, how are you today?". And then the TTS system will respond to you, "what the ****?". f**k? Hermes Frangoudis (31:20) What? Ziyi Lin (31:23) So LLM is not very stable. I just wanna point it out, this point. Well, it makes sense. Or use another emotion or just speaks in a very strange sound. The pitch is very high. Hermes Frangoudis (31:29) Well, that makes sense. I love that example. Ziyi Lin (31:40) The normal people pitch is about 100 to 200 hertz. Maybe the TTS, maybe the speech generated by the TTS system has pitch about 300 or 500 hertz and it will sound very unnatural and very strange. Hermes Frangoudis (32:03) I got you. No, that makes sense. That little difference, that nuance is something that we can pick up on and it's what really differentiates between something that sounds realistic and something that sounds like awkward and weird. It's funny because over the years, and by years, I mean, hundreds of years that humans have evolved between their senses, right? Like eyes, sound, sight, hearing, and the ability to differentiate between what is another human's voice, what is a sound in your surrounding, what is a normal voice, right? Like what is an actual person versus a machine? We've become very sensitive to that. Having the systems and the agents that can really do that right and execute correctly is important because it otherwise falls into this part where you, you know, it just leads to a terrible user experience. Well, Ziyi, it's such an amazing conversation. I really appreciate your time on this. I'm so excited for, all the new stuff coming out with TEN. But like for anyone that hasn't tried the VAD, the open source, we're gonna to have the link to the hugging face. And I really enjoyed this conversation. I learned so much today. I'm not going to lie. I thought I was decently versed in some of this and I learned I have like, so much more to go, but I appreciate you peeling back like those top layers and giving us a lot of that insight into what drives a great VAD model. Thanks, man. Ziyi Lin (33:38) Thank you. Thank you for having me.