Hermes Frangoudis (00:07)
Hey everyone, and welcome to the Convo AI World Podcast. Here we interview people that are developers, founders, teams that are building actual products and delivering in the world of conversational AI. So with us today, I have the amazing team from Palabra, the developers of a real-time speech-to-speech translation technology. Thank you so much for joining us today. Artem and Ivan, yeah.

Artem Kukharenko (00:31)
Thanks

for having us here.

Ivan Kuzin (00:35)
I'm pleased to be here today.

Hermes Frangoudis (00:36)
Cool, so let's get right into it. What inspired Palabra to develop their real-time speech-to-speech translation system?

Artem Kukharenko (00:45)
Yeah, we lived as digital nomads in different countries and faced problems with languages ourselves. And on the other hand, we had experience in machine learning and AI. For example, I built real-time computer vision systems before Palabra.

And that inspired us to try to build a product for simultaneous real-time speech-to-speech translation.

Hermes Frangoudis (01:12)
That's amazing. That sounds awesome.

Artem Kukharenko (01:12)
And it's it's

much easier to learn a programming language than a foreign language, as you can hear. And so it was always some kind of dream to be able to speak any language.

Hermes Frangoudis (01:20)
Hahaha

No, it's awesome. You're using that, that knowledge you had as a machine learning engineer to, actually like solve a real problem that you and your team were having. Speaking in like the problems you're solving, what are some of the misconceptions people have about like real-time speech-to-speech translation and how does Palabra address them?

Ivan Kuzin (01:48)
I'll address this one. So the most common misconception here is that people think that AI is something like a common translator tool, like we used to have before in the the pre-AI era. But now AI changes everything. AI can provide us with the tools to convey messages much clearer than before. And this creates an opportunity for us to share knowledge seamlessly,

effortlessly and pretty fast, like in real time. And that's kind And what we need now is to address this kind of misconception, explaining to people that we need to educate them that this technology, it breaks barriers. It works far beyond human level of translation. And that's what is important here. We're staying on the brink of revolution where people who are

who were not able to speak to each other before can do it now effortlessly. Like we're speaking to each other today here.

Hermes Frangoudis (02:50)
No, that totally makes sense. And like the ability to have that real-time conversation is really important. Right. And I think one of those elements to real-time and having a true natural conversation is latency. So how does Palabra really like balance the trade-offs between the translation latency and accuracy and all that within like live application?

Ivan Kuzin (03:14)
It kind of works like magic here. So the point is that the system does try to predict the words the speaker will say and thus decreases the latency. We can do this through full stack control of all the components within the system, which we develop in-house. Still there are limitations. So when you are saying word, hello,

like someone loves to say, it takes around 200 milliseconds. So you need around 300 milliseconds at least for it to be translated, for the system to understand how it links to other words. Also, in other languages, in different languages, you can place the word hello at the end of the sentence, like due to linguistic issues. But the system needs to understand and translate it correctly in the correct order without latency.

And that's the case where we use the sentence splitters, the data and the prediction algorithms. That's how it works. It helps to decrease the latency pretty much, pretty much okay, but we're still having those issues. I think that technology will still advance further to solve this issue.

Hermes Frangoudis (04:11)
Okay.

Artem Kukharenko (04:22)
And from the technical point, our two key metrics, which we always optimize and work on, latency and quality of translation.

Hermes Frangoudis (04:31)
So those are like top of mind. You mentioned taking a bunch of different techniques, like prediction on what's coming in, how does it fit into the rest of the context and still hitting some challenges. Are you able to share or elaborate on like what are those challenges in achieving that human like voice-to-voice replication and translation?

Artem Kukharenko (04:52)
Yeah, so the biggest challenge is latency and we did a great work optimizing it, but we still have room for improvement because we want our solution to work with zero latency on every language pair. And second big challenge is voice cloning between languages because during...

last several years, people, found out how to clone voices inside one language. So it's quite easy to clone voice from English to English, but it's much more difficult to clone voice from, Chinese to English in real time. And it's especially more challenging for, languages

with different meaning of intonations because in one language one intonation could mean excitement and in another language the same intonation will mean something else and it's a big challenge and still an open research problem.

Hermes Frangoudis (05:52)
I was just going to ask about like, how do you handle things like speaker emotion and intonation and really catching that from one language to the other, because like you said, in some languages, people are just, they speak differently. They speak louder and that's just how they are culturally, right? It doesn't mean that people are feeling a certain emotion or not. That's just how they, how they speak. So

How do you work around that and preserve that as you make that translation?

Artem Kukharenko (06:20)
Yeah, what we see is very important is to have full control over the full pipeline, because what usually people do is they try to solve speech-to-speech translation problem with three different third-party APIs for speech-to-text, text-to-text translation, and text-to-speech. And it doesn't work very well, because text-to-speech model...

doesn't have all the information, all the audio features from speech-to-text model, for example. And that's why we decided to build our system in a more difficult way, because we have to train all the components and build different adapters between these components. But it gives us much more control over the whole pipeline.

Hermes Frangoudis (07:07)
Amazing. My mind is blown. I'm just trying to like process it and get an understanding of like where in this like multilingual conversion and like  the code switching. How does like AI play this role in making that happen?

Artem Kukharenko (07:25)
 So,  in terms of code switching for AI, in some sense, it's much easier for an algorithm to switch between different languages than for humans. Because imagine if you have a conference with 10 different languages, then you will need at least 10 different human interpreters and...

you will have a lot of different language pairs. And human interpreters will need to translate via intermediate language like English. So one interpreter translates from Finnish to English, and second one translates from English to Japanese, for example. But for AI algorithm, it knows all different languages at the same time. So it can...

translates from any language to any language simultaneously. And it simplifies the overall system a lot.

Hermes Frangoudis (08:24)
That's awesome. Like how it can just know everything, right? So it's a seamless translation from one to the other. There's no intermediary. Because I feel like that intermediary to intermediary probably loses meaning, loses a lot of the context and the feeling of it. So having that one-to-one is huge. ⁓

Artem Kukharenko (08:43)
Yeah,

but that said, it's not perfect right now and it makes mistakes. And in some cases, human interpreters work with much better quality. But yeah, it's still...

Hermes Frangoudis (08:54)
But not always available, At

So let's talk a little bit about what industries or applications you're really seeing a significant impact that you guys are making in these areas with your translation solutions.

Ivan Kuzin (09:07)
Thanks for your interest Hermes, so I'll address this one. So basically, currently we're working with a lot of events. Just today we had two of them, one in Taiwan. It worked pretty good translating Chinese into English, vice versa. And it kind of also be used in broadcasting. Still not a lot of applications because the broadcasters are...

much less tolerant to mistakes, but we're heading towards this one. So live stream shows and also the dubbing. Also social commerce makes a big impact here, like WhaNot in the US and also the TikTok store, I guess they call it like this. Not speaking about the Chinese market here, because they're huge, they handle themselves in their language.

Hermes Frangoudis (09:50)
Yeah, huge, huge live commerce players. Yeah.

Ivan Kuzin (10:00)
For WhaNot, for example, it could be a big break if a creator starts speaking Spanish, for example. For now they speak only English and that kind of kills half of the audience for them.

Hermes Frangoudis (10:12)
Yeah,

it expands like who their creators can be shown the content to, right? Exactly. That's the word.

Ivan Kuzin (10:16)
Yeah, reach. It expands the reach. Like, people do convey

messages, yeah? And everywhere where the message is important, where the meaning is important, we are there to help people to convey those messages. So if you are trying to sell something to someone who doesn't speak your language, that's the use case for us. Currently, it's the most adaptive industry events, live events or online events, even better.

They already do adopt this kind of technology. But I think on the scale of one to two years, everyone will be using translation tools to reach new types of audiences. That makes it global and that makes it one of the most prospective markets out there.

Hermes Frangoudis (10:56)
No, it's, I see the need for it constantly. There's constantly points in my daily life where I'm either on a video call watching a live stream of some sort and you're not always like the native language, even as English, right? And so that really breaks down the barriers of knowledge sharing. And you mentioned that, yeah, that's huge, ⁓

Ivan Kuzin (11:16)
And that's what's important.

Yeah.

Hermes Frangoudis (11:20)
So you mentioned having rolled this out in different virtual events, live events, that sort of thing. What kind of feedback are you getting from people? Like how is this impacting the end user? Because that's the real magic, right? Like the impact it has on the person actually getting to use it.

Ivan Kuzin (11:37)
Well, as I told you before, now it feels like magic. Imagine you live in the biblical era when the tower is still building and you can speak all the languages. Yeah. And that's the case. It's like I was having a sales pitch to one LSP in Argentina and the girls on the other side, they couldn't speak

good English so they were couldn't convey correctly what they wanted to buy and then I just switched on Palabra, with a push of a button they changed like 180. They're like different personalities. I could read their personality. They could read mine, even though we were speaking different languages. I even stopped speaking English, I switched to Hebrew but it's fine.

Hermes Frangoudis (12:25)
It's like that comfort zone, right? It puts you in a more comfortable spot. gets people more open to engaging. And again, breaks down those barriers. Like you were saying, they came in not so easy to communicate, right?

Ivan Kuzin (12:37)
It does not just translate message,

It transfers meaning what's important.

Hermes Frangoudis (12:44)
That's the huge piece, the meaning, the feeling, right?

Ivan Kuzin (12:46)
Yeah, yeah. And for the

Feeling, the context. You never get out of context. And that's not just a technology. We're speaking about live magic, like coming through behind the rest here, behind our backs.

Hermes Frangoudis (13:01)
Yeah, it's amazing. Like the technology is moving so fast. It feels like we're living in the future. And in terms of Pallabra, you guys are relatively new to the space, right? There's a lot of big players here in the AI translation space. And how do you guys differentiate yourselves? Like I know you do a lot and we've spoken, but I'd love to hear it.

Artem Kukharenko (13:22)
The key here is to have focus and we are focused on one particular problem of simultaneous interpretation. So, our key goal is to have a very efficient model that works with very low latency and still can translate meanings and previous context.

Yeah, and that's a big difference, for example, and a big differentiator from large language models that are used in ChatGPT or other big services. Because they try to build AGI dialog systems, and dialogue system has one main principle that it

speaks and then user speaks. So while user speaks, the dialogue system listens. And we don't need it. We need a system that can speak simultaneously with user and can do it in 10 different languages, 20 different languages at the same time. So we focus on one but very important particular task.

Hermes Frangoudis (14:26)
That makes sense. Specialization is really what sets you apart, right? Do something different and do it really good and people will flock to it. Getting a little bit more into the details of stuff and this is a bit of a touchy subject in some circles, but privacy and data security. So like in terms of live translations, the voice is coming through, how do you keep

the data private? What do you guys do?

Artem Kukharenko (14:56)
Yeah, privacy is a very important question nowadays, and it's getting more and more important every day. And that's why we do standard things like best kind of encryption and so on and so forth. But what's more important, we process everything in memory, so we don't store anything on the servers. So once it's translated,

it's deleted. So we don't have any problems with storing data and all other privacy concerns.

Hermes Frangoudis (15:28)
Nice. That's huge. Right. And on the enterprise side, so when you're moving to talking to bigger clients and they have this concern, we'll say around ownership and data sovereignty, like how does that exactly work with you guys? And how do you.

Artem Kukharenko (15:44)
Yeah, some big clients, they want everything working inside their security control and so that they have zero probability that anything will go outside. And for such cases, we have different kinds of deployments in private clouds or even on-premise on their servers inside the security control.

And in terms of API that we support, it looks the same. But the only question is on which servers works our solution.

Hermes Frangoudis (16:13)
Gotcha, makes sense.

Artem Kukharenko (16:13)
And in most

cases, our clients can try and test our solution in our cloud first. And when they're happy with the latency, with the quality of translation, they can deploy it on their servers.

Hermes Frangoudis (16:29)
That's awesome, so you give them a place where they can take that baby step, right? Get started, get working with the product, and then once they're ready, if they have that need, move into like a self-hosted. Super cool.

Artem Kukharenko (16:40)
Yeah, of course, because

You don't want to build a data center and to install a computer just to test the technology. You want to test it with a couple of lines of code. And then if you are happy, you can then think about all the deployment stuff and so on.

Hermes Frangoudis (16:56)
Makes total sense. In terms of product rollout and development, how do you guys use benchmarking to really inform your product and what you're doing and how successful you are at things?

Artem Kukharenko (17:09)
Yeah, have for different components, we have different benchmarks. We have different benchmark for speech-to-text component with standard metrics like where, word error rate, and we have different benchmarks for text translation. But what's more important, we have...

one final benchmark with human interpreters who measure the overall performance of our system. And what we see is the most important step because you can measure all kinds of different automatic benchmarks, but they don't show... In most cases, they don't show how people...

see your technology. And it's the most important thing.

Hermes Frangoudis (17:57)
That's huge having that human element within your benchmarking and how does it, not just statistically speaking, how does it weigh in, but how does it in the real world and application. That's awesome to hear. In terms of once you get this set up, what kind of customization options do you guys offer and how do you tailor translation outputs for specific industries or use cases?

Ivan Kuzin (18:21)
Well, say we work with Agora on a case, we would want your business to grow as fast as possible. So we would show total flexibility across our product stack. So we'll try to adapt to your product needs, to your customers' needs. I know that's like a B2B2C product there. So it's kind of a challenging thing to do, but we aim to do this business-wise. As we do have control over the pipeline,

we can use machine learning for the machine to better understand the industry needs and for it to be flexible. We could learn it a little bit more on some cases, on some topics. We can do vocabulary uploads. We can adapt API methods, switch between technologies for customers to be happy, for the business to grow. We have our customers' business in mind here, and that's the main case. We want our customers

to expand to new areas where they could speak to anyone and that's our mission here.

Hermes Frangoudis (19:25)
It's awesome to hear having that customer focused approach is huge, right? In terms, ⁓ yeah, you have to focus full heartedly on the customer very much. Speaking of that, what tools or metrics do you provide to your customers to be able to like evaluate the translation quality within production?

Ivan Kuzin (19:29)
That's the only way a company can survive now, actually. that's the case.

Sure, sure.

Artem Kukharenko (19:47)
Yes, how we see it is that we test and benchmark everything inside and so that we provide the best quality that we can. But that said, our customers can use our API because it's freely available for benchmarking and they can benchmark on their internal benchmarks

as they like. And we also have a couple of no-code solutions that you can download, for example, benchmark and try and test them yourself without the need of writing any code.

Hermes Frangoudis (20:26)
No code is huge. Nowadays, that's like one of the main drivers, the ability to just get in there without much setup and put something together that works. In terms of the developer experience, right? Because that's really near and dear to my heart in DevRel. What do you guys offer? Do you have documentation, quick start guides? Is it on your GitHub? If a developer is looking to use Palabra.

Where can we send them?

Artem Kukharenko (20:52)
Yeah, we have an API and documentation for it. And we are working right now on a huge update on the API. And it will be available very soon. And we will have a client SDK for most popular programming languages. So it will be really easy to use our API and to start.

using our API. And our low-level API is built on top of WebRTC. And there will be a couple of options. So you could either use WebRTC or use WebSockets or use our client SDKs.

Hermes Frangoudis (21:28)
Nice. It's awesome. So we'll make sure that we link to the docs and the developer portal for you guys to make sure people know where they're going. Moving a little bit more into the futuristic thinking, we'll call it. Like seeing where you see this industry, what advancements in, with like the recent advancements in AI, how's Palabra

approaching the integration of speech translation with like other more broad AI capabilities?

Artem Kukharenko (21:58)
So, as I said before, we are focused on a very narrow problem, on one problem of simultaneous interpretation. And so we built algorithm specifically for simultaneous interpretation. So we build it on top of different models for text translation. And I think here is the most intersection with other development in AI

goes, but we fine-tune it for our specific use case, fine-tune and optimize it. For example, we want to be able to easily fine-tune all the pipeline and all the algorithm to any client's need. Because we see that it will be a big differentiator for us from

big companies and from other companies that eventually will do speech translation.

Hermes Frangoudis (22:51)
Totally makes sense. And it's really cool that you guys are always like optimizing and bringing in different AI, focusing on that specialization still pretty cool. In terms of overall advancements in the field, like what do you see as being like a couple of the key pieces that need to be made, like the key advancements that need to be made over the next,

I think five years is probably too much. So over the next two years, let's keep it quick because the windows are fast and the AI space.

Ivan Kuzin (23:22)
Yeah,

We first speak about five years, I guess, we'll be seeing really great advancements. All the translations will be seamlessly integrated into all the phone calls, all the communication on the OS level. When you start the stream like we have now, it will easily and seamlessly be translated into any language. So that's how the barriers would be broken. We are moving humanity like past

past this biblical issue where everyone will be able to speak to anyone. The speech prediction will be on almost magical rate. I think in two years already, in five years it'll be crazy. But I think we'll see seamless translation of all the content you have on your phone, like in two years already, in five years it will be AI created, then AI translated.

Hermes Frangoudis (24:09)
That's going to be the day. Right. And in terms of like this AI to AI world, what role do you see, you know, unsupervised learning really playing into this future development of the speech models?

Artem Kukharenko (24:22)
Yeah, so I think that unsupervised learning already plays a huge, maybe the biggest role right now because all the large models like GPT-like models are pre-trained with unsupervised learning. It's true for text models, it's true for speech models. So I'd say it's the...

main technique right now to train large models. There are other techniques that are applied on top of unsupervised pre-training, but unsupervised pre-training gives a huge, huge gain in performance and in accuracy of algorithms. And probably it will play an even bigger role in the future because the amount of data

is growing exponentially and you can't label this data. So you have to have techniques that could train algorithms without data labeling. Yeah. But what's very important is data pipelines because you can't just take any data and upload it to the AI.

Hermes Frangoudis (25:22)
Like manual intervention. Yeah.

Artem Kukharenko (25:34)
You need to preprocess it and you have to clean it. You need to clean it without human laborers because they are time-limited. But you need to train algorithms and build pipelines so that you have a clean data for unsupervised pre-training. And it's a very big part of current AI production systems.

Hermes Frangoudis (25:57)
Yeah, it's here and it's only getting bigger, right? It's only going to take over more because there's just too much. It's not probably humanly possible to tackle the problem with people and without AI. So getting a little bit more back into Palabra. How does Palabra really think about integrating speech-to-speech translation with downstream AI reasoning?

Artem Kukharenko (26:21)
Yeah, so we see that it will be natural to allow customers to choose language in which they want to listen to the content. And it's the main downstream application for Palabra. In the future, it will be just

commodity in a good sense that people will not even think that they can't do it because it will be very natural to choose language the same as you choose language on which you see their website today. It will be the same for audio.

Hermes Frangoudis (26:58)
So it'll be like detecting the browser language and automatically bringing you there, way more, way less touch, so to speak, from the end user and way more of an expectation. Like if you're not doing this.

Artem Kukharenko (27:10)
Yeah, it will be strange if you don't provide it.

Hermes Frangoudis (27:10)
You're missing it, right?

So I got one more. This one's a little bit different than all the other ones. If you weren't on real-time speech translation, what voice AI domain would you bet on next?

Ivan Kuzin (27:26)
I think it'll be dubbing actually because this is an existing and ever growing business and the amount of content grows exponentially. You know that Netflix films a lot and Amazon does as well. So dubbing is like the next thing here aside from speech to speech and I would bet on that one. And you, Artem?

Artem Kukharenko (27:46)
I'll go into dialog systems because I see huge applications for them. The accuracy of dialog systems is very impressive right now, but there is still room for improvement and it will open more and more other applications. It's really mind-blowing how the technology develops.

That right now we can solve a lot of tasks that we couldn't even think about 10 years ago or five years ago.

Hermes Frangoudis (28:21)
The abilities for an individual and like what they can do paired with an AI. It's just mind blowing. Like how far that's come. I remember starting in the industry with an AR company and we did a lot with computer vision and like how early that was back 10 years ago now.

And how it's changed to the point where you could feed an LLM an image and it has like all these different algorithms in there, right? You could feed it audio, but the ability under the hood of how it handles that and interprets it and translates it out is really the magic of what you guys do. yeah, I remember that.

Ivan Kuzin (29:00)
You remember the first Google Glass? ⁓ They're pretty much

ahead of their time, but if they had an AI on board, that'd be crazy. And they do have now, actually.

Hermes Frangoudis (29:09)
would have been different.

Artem Kukharenko (29:11)
You said about virtual reality and one of our first clients was actually from virtual reality area and they do conferences in virtual reality. So imagine that you can attend conference in virtual reality and you can choose language on which you speak with other attendees and on which you watch the

main presentation and so on.

Hermes Frangoudis (29:38)
That must be amazing, especially given how niche that industry is, right? There's not a lot of people globally doing it. So when they have an opportunity to come together, it becomes even more important that they can communicate without that sort of language barrier.

Artem Kukharenko (29:50)
Yeah, but

The craziest thing is that the conference was about agriculture. So it was something about cows, but in virtual reality and with a translation.

Hermes Frangoudis (30:01)
Super cool. man

Ivan Kuzin (30:02)
Were there any

cows wearing VR glasses?

Artem Kukharenko (30:04)
I didn't see.

Ivan Kuzin (30:06)
You

Hermes Frangoudis (30:08)
Man, super cool. I want to thank you both, Artem and Ivan, for joining me today. Really appreciate you guys taking the time to tell us about the amazing things that Palabra is doing for the speech-to-speech translation industry. I want to thank all our viewers, whoever watched the live stream and everyone watching the recording.

Please don't forget to like subscribe and share and be on the lookout for more episodes from Convo AI World podcast. See you on the next one.

Artem Kukharenko (30:35)
Thank you. It was a pleasure.

Ivan Kuzin (30:38)
Thank you. Thank you all. Bye.