Hermes Frangoudis (00:07)
Hello everyone and welcome to the Convo AI World Podcast. I'm your host, Hermes Frangoudis, Director of Developer Relations here at Agora. Joining me this week is Ben Weekes, Principal Architect. These is the first episode in a series that's going to be exploring everything voice-first Conversational AI through honest conversations with practitioners, builders, infrastructure engineers, product strategists,

people from the trenches providing actual insights on what it takes to build this best-in-class Conversational AI experiences. So whether you're a developer, Conversational AI enthusiast, or a business person trying to understand this world and navigate it, you've come to the right place. Without further ado, we'll jump into our first episode, What is Conversational AI? And thanks again, Ben, for joining me this week.

Ben Weekes (00:54)
Thanks, Hermes. Thanks for inviting me on to be your first guest in episode one and wish you all the best with this series. It's very exciting. I've got the TEN t-shirt on, which is Agora's open source framework for well it stands for Transformative Extensions Network, but for joining lots of different AI endpoints together in highest possible quality. And we've been colleagues together at Agora for quite a few years now.

Hermes Frangoudis (01:16)
Yeah, it's great to be doing this with you, Ben. We always have a good time, great conversation. So I'm excited for today's show. And I think

get it started off really simply for the viewers at home, for everyone kind of joining us, let's define what we mean by like voice-first, Conversational AI and how that's really different from chat based system. So Ben, maybe in your words, if you want to take this one.

Ben Weekes (01:38)
Yeah, So the way I look at it is that very much like talking to a human, you can talk to AI in three different modes. It's called You've got chat, you've got voice and you've got video or pictures. And While text and chat these days is probably the most prolific volumous when you get into voice, that's where you have your more intimate experiences. And If you're ever in a video call, it's the voice that's the most important thing. If my voice starts to go,

in the video call, everyone's going to lose interest and start reaching for the door. But if the video freezes for a little bit or the text is a little bit late, doesn't really matter. As long as you've got continuous audio and our ears are really sensitive to that, then we're connected.

Hermes Frangoudis (02:21)
Yeah, the brain kind of picks it up. It'll It'll let you drop a frame or let it freeze like it'll, but the audio interruption really kind of kills the flow, right? And And that's where voice first really matters.

Ben Weekes (02:27)
Yes.

Exactly. And in the old days on analog

phones, they used to play something called comfort noise, so that you could actually feel that you were there even in silences to kind of create that feeling of connectivity.

Hermes Frangoudis (02:41)
That's really So getting back to voice-first AI, and you brought up an interesting topic around comfort noises and what they used to do to improve the experience. Right now, what are some of the real-world examples of voice-based agents you've seen? I remember you telling me something about last time you went out to dinner.

Ben Weekes (03:01)
Yeah, so in terms of pure voice agents, these days, if you try and book a restaurant in the UK, I'm not sure it's like in the US, but probably similar, you're always going to get through to a voice AI agent, and they actually sound really human. It takes you a while to actually realize because they introduce themselves with a name and everything, and they've got very, you know, lifelike, emotional voices. And compared to the old days of ringing up your bank to try and do something over the phone, where

most of the time it didn't understand you. Even if it did, it would just put you through a sort of a memory, sort of a flow chart IVR

And that wouldn't really, yeah, wouldn't really save you.

Hermes Frangoudis (03:34)
Super clunky, right? You gotta figure out which basket you fit in.

Ben Weekes (03:37)
Exactly. If anything, it was just trying to get you to go online. That was the whole purpose of that. But now you really can just, just phone up. And because you're on your phone, it knows who you are. The identity of your phone number allows it to the system to be able to like, recognize you from making your booking before. So you can modify your booking, but yeah, certainly in the restaurant area.

We're seeing a lot of pure voice AI services, but any sort of call center that you might be calling up is definitely gonna be benefiting from the recent advances in all the different parts of the voice, Conversational voice pipeline.

Hermes Frangoudis (04:10)
Well, I'm so glad you brought up pipeline because this is really an opportunity where we can talk about like, let's define the tech stack. Let's break it down. What is the first piece of it, right? Like the speech-to-text, what's happening there? What is ASR and like noise cancellation, voice activity, the VAD. Like let's dive into a little bit of that.

Ben Weekes (04:26)
Yeah.

So the three main components of a voice Conversational voice pipeline are the speech-to-text or sometimes called ASR. They're sort of interchangeable, but they mean slightly different things. And that's where the computer gets to understand what you're saying. And that gets turned into text. The second part is then being able to take that text and come back with an answer. We've all played with ChatGPT online. You know what it's like. You send it some texts. It comes back pretty quickly with a pretty decent answer. You can ask it to keep that.

limited to a certain number of words and then those words can get turned back into voice again and spoken back to you. And that's what's known as a Cascading Pipeline. That's one sort of field of this. The other area is what gets called a Real-Time Pipeline. They're both actually

Hermes Frangoudis (05:12)
So is that the voice-to-voice pipeline?

Ben Weekes (05:14)
Yeah.

OpenAI coined the phrase real-time, even though the cascading pipeline I just described is actually just as fast. So it's also real-time, but real-time as coined by OpenAI and also Google Gemini does the takes in voice directly into the LLM and sends voice back out again. The idea, the idea being that it's even faster.

But even though the speeds are similar, it does have other advantages in as much as the LLM can actually hear the emotion in your voice. It can comment on your pronunciation of certain words in the language intuition set up. And it's able to format the tone of the output of your voice to actually match the sentiment of what it's saying a little bit stronger.

Hermes Frangoudis (05:58)
we were interested in.

Even though it's a bit of a black box system, it has its advantages in taking the audio in first and dealing directly with it versus the cascading that's more.

Ben Weekes (06:05)
Yeah.

In theory, but even with cascading, when the speech-to-text happens, you can put in, or some speech-to-text engines will put in metadata to describe how the person spoke it. Like, was it sad? Was it energetic? Those types of things. And also when you send the text from the LLM output to the text-to-speech engine, you can include markup, speech markup to tell it which bits should be pronounced in certain ways.

and again control the emotion in your voice that comes out at the end. So actually, they're pretty similar. even though no one really knows exactly unless you work at OpenAI or Google Gemini, I think the implementation behind the scenes of their real-time voice-to-voice models is kind of around a cascaded model anyway. Although interestingly, when you look at the text coming out of OpenAI along with the voice, they often mismatch. So I might hear

Hi Ben, how are you? But in the text it's saying, Dan, how are you? So they're not perfectly in sync. It's like they're using two different systems to create the effect of a pure multimodal LLM. And by multimodal, we mean capable of speaking text, voice and video all at the same time.

Hermes Frangoudis (07:17)
So the text is more of like an approximation as to what it said versus like a word for word transcript.

Ben Weekes (07:20)
Yeah, they're basically applying Speech-to-text

on what goes in and out of the LLM separately to what the voice LLM is actually doing. And they're not perfectly in sync, which is a bit misleading at times. At least with a cascading pipeline, you know that what you see in the text is 100% going to match what was said and what it thinks you said and what it's trying to say back.

Hermes Frangoudis (07:29)
Very interesting.

⁓

Yeah, I think that transparency is pretty key though, right? Because the black box method has some of those limitations. You don't get the same access to that information for auditing purposes and set up in that sense. But you also have kind of like limited control, right? You have to use whatever function tools that they have put in place there and that sort of thing versus cascading. I feel like you could put your own.

Ben Weekes (08:05)
Yeah, so with cascading,

you can mix and match your providers like my hottest pipeline right now for the language learning use cases to use ElevenLabs for speech-to-text, even though they're known for text-to-speech, but their new speech-to-text is really good for multilingual. Then into GROQ for your LLM, probably Lama 70 billion model, which is super fast. You know, it's on specialist hardware, so it replies really quickly.

And then into OpenAI's text-to-speech, which is able to mix languages and has really good emotion built-in, different accents, dialects, it's really quite good at telling jokes, it can whisper, it can do this type of thing.

Hermes Frangoudis (08:42)
That's a pretty interesting pipeline, right? Like completely from left field in terms of what each of those providers is really known for, right? You would think OpenAI would be maybe more the LLM of choice and

Ben Weekes (08:44)
Yeah.

the other one. Yeah,

Hermes Frangoudis (08:55)
for the PGA.

Ben Weekes (08:55)
Exactly. It's almost back to front, isn't

Hermes Frangoudis (08:57)
Yeah, that's super cool though. And I'm gonna have to try that out for my next project. ⁓

Ben Weekes (08:57)
It's interesting.

Yeah, that's my

language learning for my companion sort of, you know, semi dating where you maybe want the LLM to be able to not be restricted to not being able to, you know, swear or talk a bit more in an adult fashion. Then you can use X AI for the LLM side of it. And that's really quite good at providing conversation. And then you can get different voices depending if you want to clone someone's

Hermes Frangoudis (09:25)
like using your

own voice or having a specific speaker, right?

Ben Weekes (09:27)
Yeah, well, if you're making it,

Yeah, if you're making an avatar of someone famous, perhaps you might want to clone their voice and provide the exact look and feel of that person. And then on speech-to-text, think Deepgram is generally considered pretty fast and they're getting into lots of different languages right now as well. But one of the beauties of a platform like TEN is we're kind of agnostic to all your different speech-to-text, LLM and even your avatars, which is the video element of a conversational AI agent.

Hermes Frangoudis (09:36)
So you ⁓

Ben Weekes (09:55)
And you can mix and match depending on your needs. And different verticals and different regions of the world tend to have different advantages with different providers depending on what you're building.

Hermes Frangoudis (10:05)
Before we get too far into the infrastructure, I want to back it up two steps because you brought up a very interesting topic ⁓ that I do want to touch upon, and that's the virtual avatars, right? So these are, they're the other dimension, right? We spoke about text, voice, and now the visual element of the conversation that can kind of get a better understanding of the context in that sense,

Ben Weekes (10:24)
definitely.

Yeah.

And like there was the whole uncanny valley thing a few years ago, but various avatar providers are getting really good now with video based avatars, which look just as good as real humans. And that is the missing, that is the video element of this. So in certain use cases, again, like learning a language, it's really important to see the mouth, you know, when you're learning, when you're getting used to sounds, or if you're having a companionship or maybe mental health, you're more likely to open up if you've got that face, the visual contact and

all the other emotions that get triggered in your brain by having someone else's face in front of you in addition to the audio is really important.

Hermes Frangoudis (11:01)
It adds another layer to the

emotional connection, right? Being able to visualize, not just hear it. So that multi-sense.

Ben Weekes (11:04)
Yeah.

If I

couldn't see you now, that would be less than half as good. mean, the voice is important, as we said, but the video is the cherry on the cake.

Hermes Frangoudis (11:14)
That's a great way to put it. ⁓ Now let's kind of take that jump forward into more of the infrastructure side of it, right?

Ben Weekes (11:22)
And maybe I should just quickly touch on some of the different technologies involved in providing really high quality conversational voice, such as your AI for voice activity detection. Cause it's really important with that cascading model and even with the voice-to-voice to know when the agent, the AI should reply. You know, is it when I stopped talking for 300 milliseconds? Is it when I've actually said something which implies that I should reply?

Hermes Frangoudis (11:29)
Yes.

Ben Weekes (11:47)
And maybe I've just like, I'm thinking about something and it should allow me the time to respond and.

Hermes Frangoudis (11:53)
So that's more than

just VAD, that's like turn detection as well, right? Yeah.

Ben Weekes (11:56)
Exactly. it's

it's both of those wrapped into one and using a lot of AI to predict when the right time to reply

Hermes Frangoudis (12:04)
And I think TEN just released

some stuff for that, right? They just released a new open source model on turn detection and VAD.

Ben Weekes (12:10)
Yeah, Yeah, exactly.

And without that, it's not a very good experience. Like if people were if the agent replies too quickly, or if someone else like if I got my radio or TV on, and that triggers a response, then you know, that breaks down quite quickly in terms of a natural conversation. And that's actually trying to use and trying to talk to an AI agent in a in a public place right now is definitely well, this time last year was almost impossible. And yeah, it's getting that's a big area of research at the moment

and making it work well.

Hermes Frangoudis (12:39)
So in terms of maybe like the infrastructure side of things, what would you say are the biggest pain points in developing a Convo AI voice agent at scale?

Ben Weekes (12:48)
⁓ yeah,

Typically you're going to have a server running somewhere, but you want your users to be anywhere in the world and to be able to transmit that voice, reliably and quickly with the lowest amount of latency or lag, is a real massive challenge. And again, That's where Agora steps in with its overlay network.  Agora has been doing voice and video between real humans for like 10 years now. So Being able to take all of those skills and as well as the

Hermes Frangoudis (13:14)
At a global

scale, right? Like it doesn't matter whether you're next to each other or hundreds of thousands of miles away, you're getting the same low latency streams everywhere.

Ben Weekes (13:16)
Yeah.

Yeah, and the algorithms for echo cancellation, noise reduction, they're all completely applicable to this as well. ⁓

Hermes Frangoudis (13:29)
Yeah, Because at the end of the day, the

AI has to hear what you say, right? Like if you lose a packet, that's probably some of the most important packets. Like if it's person to person, it's a different thing. My brain can piece together kind of what you're saying. But if the LLM doesn't get it, it really messes up the message and goes completely wrong direction.

Ben Weekes (13:42)
Yeah.

And 90, In

fact, nearly all AI service providers, so all your speech-to-text, your LLMs, your text-to-speech, they're all using WebSockets, which is a TCP based technology. So the further away you move from America effectively, where they're all being hosted, the bigger the latency you're going to have, and the more packet loss you're going to have as you get further away on the public internet. So having you...

Hermes Frangoudis (14:06)
Yeah, because WebSockets

are really made for that last mile, right? Like you have to have a really strong connection and under optimal conditions, WebSockets are great, but it's real life. Let's be real

Ben Weekes (14:15)
Yeah.

That's where UDP, which is basically sending packets without waiting for them to be acknowledged and automatically retransmitting them if they become late or lost is much better for video calling. And again, that's kind of Agora's bread and butter from, from doing, you know, video conferencing voice-to-voice and video-to-video between real humans.

Hermes Frangoudis (14:37)
And it's interesting because Agora supports up to what, like 60% packet loss while still being able to put through a signal that the person on the other side can understand, right? So in terms of the encoding, maybe it'll drop from stereo to mono.

Ben Weekes (14:52)
And

In a two-person human call where I might be here and you could be in you've got packet loss on your downlink, typically every other provider that I know would get the publisher to resend those packets. Whereas with Agora having points of presence all around the world, you can get them resend just from your India access point without me in London having to resend them again.

Or a lot of other networks tend to use a single node or conferencing server. So we'd either be having a conferencing server in London or India, and one of us would have to be requesting from a long way. Even Google Meet uses a single conferencing server somewhere in the world for joining all the people together rather than multi points of presence.

Hermes Frangoudis (15:23)
Or Santa Clara.

That's interesting. So it's using like a single almost like an SFO of sorts. SFU. SFU. Yeah,

Ben Weekes (15:38)
SFU, yes.

Forwarding

Hermes Frangoudis (15:43)
I'm flying out of SFO later tonight. So yeah, that I think that's a great breakdown of how the infrastructure side of it really matters. And in terms of frameworks, right, like we're talking infrastructure frameworks. There's gotta be

some great hitters. We talked about TEN. There's a couple other out there. Is there any ones that really like move the needle and which ones are really there for production and ready to go?

Ben Weekes (16:08)
I think the

Big ones are ⁓ LiveKit. They've got an open source LiveKit Agent and LiveKit Cloud. And then you've got Pipecat, which is kind of spun out of Daily, mainly using Daily's network for the voice and video and then TEN, which is Agora. And it's no sort of surprise or shock really that all three of those have come out of traditional

online video conferencing technology because it's that UDP framework which really helps. I think the big difference

to look at those platforms is both LiveKit and PipeCat are Pythoned and through and through quite easy to learn and get up to speed with whereas TEN is

Hermes Frangoudis (16:48)
But that's a lot of overhead,

isn't it? In terms of operational costs, you're going to have to run more powerful machines.

Ben Weekes (16:53)
Yeah, From a runtime

performance perspective, that's true. But in terms of developers, you know, getting up to speed quickly with some with some examples is quite straightforward. Whereas the TEN framework from from Agora is written in C, it's a bit of a steeper learning curve, but the end result gives you much more powerful performant agents running on your on your servers. So it's more scalable. And I think Agora is all about scalability.

Hermes Frangoudis (17:19)
Yeah, scalability without compromise on flexibility, right?

Ben Weekes (17:23)
Yeah, and even though

there is a bit of a learning curve and not many people know C that well, it's quite complicated these days with AI helping you to write code. You can just ask it to create your TEN extension, you know, ask Claude or whatever, show it the docs and the template code. You did some of that, didn't you? And then it will

Hermes Frangoudis (17:38)
yeah, I

made a little CLI tool that you could ⁓ describe the plugin that you wanted and it would format it correctly and everything. But I think the TEN teams also have been doing a lot with their designer tool, that drag and drop, being able to define the plugin, yeah, the graph tool.

Ben Weekes (17:41)
Yeah.

Yeah.

The graph tool,

yeah, I'm not sure about that personally.

Hermes Frangoudis (17:58)
I saw

a nice little sneak preview. coming. It's going to be something cool. in terms of, So those are great frameworks. The TEN framework is open source, right? And you talked about how Pipecat has Pipecat Cloud. LiveKit has LiveKit agents on the hosted side of it. TEN is the base for Agora's Conversational AI, right? And our Conversational AI Engine is

that best in class VAD, noise cancellation, and allows you to bring your own LLM and TTS to the equation, right? So in that cascading model in Convo AI, I wanna pick your brain on where you see a lot of the strength in that and how that's kind of a better model to approach than maybe some of the other hosted frameworks.

Ben Weekes (18:25)
Let me see.

So you mean cascading versus real-time or tense versus...

Hermes Frangoudis (18:45)
No, like

our version of the cascading in Convo AI Engine and really the flexibility that it gives developers on their ability to launch Conversational AI engines with that best in class SD-RTN network.

Ben Weekes (18:56)
Yeah. So I

I think the main main sort of advantages would be that the extensions will ultimately be running in a C environment. So they'll be faster when you publish them into Convo AI Cloud. And then you've got Agora's SD-RTN real-time network that we spoke about, which is multi points of presence rather than everyone connecting to a single point for a server. And then the, the AI for interruptions,

voice activity detection, echo cancellation and noise reduction. All of those are sort of best in class. And I think a lot of the other services are generally WebRTC wrappers. So you're basically, they're all using the same kind of RTC technology. And while Agora does use WebRTC and the web browser, we also have native SDKs for every platform. as well. And Our technology for actually calculating the bit rate, the available bandwidth between

the server and the web browser is custom to us. So our, We call it VOS web internally, but it's a much more advanced way of detecting whether it's line packet loss or congestion packet loss on the downlink. And then knowing how to deal with that either by resending packets or by reducing the bit rate. And with voice, you can't, you know, Some of this stuff isn't quite as important because voice is low bit rate and you have forward error correction where you just retransmit the packets anyway, they're little.

Hermes Frangoudis (20:05)
fair.

Ben Weekes (20:15)
But when you get into Video with avatars, it's much more powerful to have the Agora architecture where you can accurately know what the downlink bandwidth is and address the downlink connection by either increasing the bitrate or reducing it, depending if it's line or congestion packet loss. And we've also got access to advanced AV1 encoders, which are more efficient than the standard ones. So 

Hermes Frangoudis (20:39)
Yeah, they're like

what, 30 % more efficient or something? Massive like that.

Ben Weekes (20:42)
Yeah, the

Same with our H264 as well. We've really optimized the encoder so we can do that at higher scale. we can do 1080p, 30 frames per second at 30% less than the benchmarks. So again, I mean, when you're building a proof of concept science project, maybe you would find it quicker to do it in a pure Python thing taken off the shelf. But if you're thinking about delivering a service to thousands of customers concurrently, then you really need to

to be thinking about looking at at performance

Hermes Frangoudis (21:12)
performance

and the scalability of it. Speaking of delivering to customers, something that you really need to think about is the idea of function calling and tools and RAG. So let's demystify these things, because I think you have a really cool perspective and would love to have people hear it.

Ben Weekes (21:28)
Yeah.

Okay. So We all know that when you when you send a, it's called a prompt into ChatGPT, that's a big load of text and you'll get back a response. And our RAG, which stands for Retrieval Augmented Generation, is basically the ability to pull some information from a database, just a regular database or a vector database to give the LLM more information about that topic.

So Let's say that I was an end user asking something about, I don't know, studying to become a hairdresser or something that maybe my LLM didn't know much about, like the exams around it, because it hadn't been trained on that. It would be possible to pull information from a specialist database about that and include that in the prompt. So when I'm asking a question about what do need to prepare for my finals of hairdressing, in goes the syllabus,

along with my question and the LLM can then provide a more accurate answer than just relying on whatever it was trained on two years ago. So yeah, RAG for all ever it is, is just providing extra information in that prompt. And then function calling is the ability for the LLM to trigger an event. And you can do that quite easily. mean, lots of different ways, like you could just say to the LLM, if you want your avatar to shake its head, include some angle brackets, head shake and

And then instead of playing that out with the TTS, the client would just send a message to the avatar to get it to shake its head.

Hermes Frangoudis (22:50)
So that you would kind of extract that extra metadata out and then process from there.

Ben Weekes (22:53)
Yeah, yeah,

But by using functions or tools, it's a much more standard way of doing that such that it doesn't get mixed up between what should be spoken and what should be acted upon. And even though tools are fairly new and they don't work very well in anything really that's been around for longer than a year or so, like all the old Llama stuff and most of the

LLMs out there, they're not completely standard and compliant with the latest tools specification. OpenAI is pretty good. Deepseek seems to be pretty good for most of the models. But again, you can avoid some of the models being behind by just asking it to include keywords instead in your output. But with a voice-to-voice model, the real-time ones, you really need to use functions there because you don't want the text coming out to...

to contain anything other than what should be spoken because that's how they work. They're voice-to-voice.

Hermes Frangoudis (23:49)
Speaking of voice-to-voice, it reminds me of something we were talking about the other day and you brought up a very interesting topic around how the longer you talk to the voice-to-voice model, kind of the worse it gets, right?

Ben Weekes (23:59)
Well, no, Yeah,

That's a whole new subject. The pricing, right? The pricing of cascading versus voice-to-voice. Voice-to-voice is like 20 times more expensive to begin with. But when you start off, every as you progress in a cascading conversation, every request into the LLM has to contain the history for the whole conversation. So after a few minutes, you're sending in a massive context window, which is the history of this conversation.

Hermes Frangoudis (24:25)
For the voice-to-voice

you mean, right? With cascading.

Ben Weekes (24:27)
You know, With cascading, yeah.

Cascading works with like a synchronous request into an LLM and a response out. That request will contain your RAG. It'll contain your function definitions or telling it to include certain things. And it will include the question or

Hermes Frangoudis (24:43)
So

Ben Weekes (24:43)
Yeah.

Hermes Frangoudis (24:43)
your message history and then your prompt, right?

Ben Weekes (24:46)
every time you continue the conversation, the whole history of that conversation has to be included in it. Which means that because you're paying per token,

as the conversation goes on, it's getting exponentially more expensive with cascading.

Hermes Frangoudis (24:58)
But

you can do stuff to reduce that, right? Like you could summarize some of the history and

Ben Weekes (25:03)
Yeah,

And yeah, of course, you can do that type of thing. But if you want to have lots of detail in your history, if you're playing a game or you're doing some sort of role play, it's quite hard to summarize it without losing information. Whereas the voice-to-voice model, it's expensive to make each request, but it's a fixed price. And you don't have to keep sending the history because they're effectively maintaining the session in the server, which is probably why it's more expensive per request. But it doesn't grow

exponentially in cost as the conversation progresses. ⁓

Hermes Frangoudis (25:34)
But doesn't

the voice-to-voice kind of lose some of that original prompt and

Ben Weekes (25:39)
Yeah,

it certainly does. I mean, if you build a language tutor commenting on your accent right now, your pronunciation with Google Gemini, first few sentences, it sticks to the program. And after a few sentences, it tends to start talking about something else entirely. Or,  not even commenting on the words that you're mispronouncing. it loses the thread of the conversation quite quickly. And you're right, you can't keep

Hermes Frangoudis (25:59)
You

Ben Weekes (26:06)
updating the prompt. You only update the prompt at the beginning of a session with a voice-to-voice model.

Hermes Frangoudis (26:11)
Versus the cascading you can kind of adjust it every time it comes in, right?

Ben Weekes (26:13)
Every time, yeah.

I read something recently that you get much more accuracy if you always put the actual prompt at the end of the history rather than the beginning of the history. I mean, silly, obvious, not so much obvious, but just like basic trial and error things that people have learned about LLMs and how to get them to be accurate over longer periods of conversation.

Hermes Frangoudis (26:32)
My brain is full of amazing little nuggets that you've passed through here.

I don't know if you want to go into maybe like a demo. I think you've created something pretty cool so we should probably show that off. ⁓

Ben Weekes (26:39)
Yeah.

I've got a little demo

Is that visible? So in this demo, we're gonna have voice video and ⁓ chat modes. Here we go. And this is using an avatar to sell a watch. It could be on a live shopping platform.

Hermes Frangoudis (26:45)
Yeah.

Ben Weekes (26:58)
Welcome to my store, how you all doing? Let's get started here.

This Patek Philippe Aquanaut 5167A is  priced at $22,800. You're getting a real deal at 8% under retail. With its exceptional features and a potential value. Is it stolen?

Rest assured, it is not stolen. The package comes complete with its certificate, authentication report, and all original documents as well as boxes confirming The Patek Philippe Aquanaut 5167A features a striking black checkerboard dial accentuated by the sleek steel case. The package also includes the signature black tropical rubber straps. Does it have any scratches?

The 40.8 millimeter steel case has only faint hairlines, adding character but not detracting from its beauty. The crystal remains flawless, ensuring clarity and elegance when you wear this timeless piece.

This Patek Philippe Aquanaut 5167A, priced attractively at $22,800. I'd like to buy one. How soon can you ship it? Great choice. Once payment clears, we'll arrange insured overnight shipping, ensuring it reaches you swiftly. The package includes a three-day inspection period, giving you peace of mind. Yes, please. Let me buy it and ship it.

Hermes Frangoudis (28:08)
you are... ⁓

Ben Weekes (28:08)
Fantastic. Once

you've completed the regular purchase process and payment is confirmed, we'll ship it overnight. You'll soon have this exceptional Patek Philippe Aquanaut on your wrist. Secure the deal now and enjoy.

Hermes Frangoudis (28:11)
You're going to enjoy that watch, I think.

Ben Weekes (28:21)
You're welcome. If you have any other questions, feel free to ask. Take care. If you ever need help, I'm here. Goodbye. And you know, there was a pause though. I didn't say anything on purpose, but the LLM actually continued. Because Usually, you think about an LLM, it's one request, one response. And actually that's what happens here. But when the state changed from talking to not talking and the other person, the playing character as he gets called,

didn't say anything in that scenario, then it triggered another message to get the seller effectively to continue selling. So it's this idea that you can have a continuous live shopping channel, multiple people texting or typing. I mean, I allowed it to interrupt there and perhaps you wouldn't want that. You'd always want the agent to complete what it's saying, but it just showed the multimodal, the ability to interrupt. and yeah, People will be able to sell to, you know,

lots of people simultaneously rather than just one person at a time.

Hermes Frangoudis (29:15)
That was super cool. cool. I think the ability to interrupt it both with speech and text was very interesting. It has that multimodal. And that's that cascading model. You have full control over the pipeline and how that data is moving through.

Ben Weekes (29:27)
Yep. ⁓

Exactly right. And no, you're totally right. Because with the voice-to-voice, the thing just doesn't like to be left quiet. It'll just start talking on it. So you can't keep it quiet.

Hermes Frangoudis (29:40)
So you're going to have to constantly interrupt it at that point, which is a weird experience.

That's kind of everything that I have on my list of things I want to talk about for this episode and go through. Ben, I really want to thank you for joining us today and for everyone watching

following along. Thanks.

Ben Weekes (29:55)
It's Been a great

pleasure, Hermes. And if any of your future watchers want to get in touch to get help with any projects, I'm always excited to hear what people are building. Please reach out and look forward to seeing what gets built in this space over the next year or so.

Hermes Frangoudis (30:09)
For sure. You can reach out directly through the main Agora accounts on LinkedIn, Twitter, YouTube, or you can find us on LinkedIn and Twitter.

Ben Weekes (30:15)
Yeah, LinkedIn's good.

Hermes Frangoudis (30:19)
Cool. Thanks again everyone for watching. Thanks Ben for joining. I ask everyone please like, subscribe, share this with anyone that you think would find value in our conversations because we're really just trying to bring a lot of light into this space where most people are building and there's a lot moving very fast and we've got to try and share this knowledge base so that everyone can be growing together, right?

And I want to invite everyone to be on the lookout for our next episodes. We're going to be live streaming them as we record and then releasing the final edited versions for everyone to consume and share. So that being said, we'll see you in the next one.