Derek (00:00) wave speed AI is an excess layer of the AI generation. Zeyi (00:05) own solutions to push their models on their own GPU farms. Your API are not a demo. Derek (00:13) we often see regional preferences. For face beauty, Asian people has a different preference than European people. Zeyi (00:20) API platform is the best solution right now. Derek (00:30) Hi everyone and welcome to the Convo AI World Podcast. This week I'm in Japan. It's an amazing opportunity to talk and meet a lot of industry leaders who have truly unique perspectives on AI. And it's always special to have face-to-face conversation with people who are shaping the future of this field. And I can't wait to bring these insights, ideas to all of you. All right, let's get started. Today we have Zeyi from Wavespeed AI to join our podcast. Zeyi has deep expertise in AI media generation with a lot of experiences in building the success So let's dive right in and start today's episode. Zeyi, would you like to say hi to our audience? Zeyi (01:16) Hey everybody, so I'm Zeyi from Wavespeed.AI I've been working in this generative media area Zeyi (01:25) three years and now I have brought Wavespeed AI to the market for over half a year. We have achieved great success in both the marketing and also in the product area. It's very honour for me to have this chance to have this great conversation. Derek (01:48) Thank you. it's also our honor. So can you also do a little bit introduction for your background and what led you into this industry? Zeyi (01:57) Alright, so I have been working as a software engineer for over five years. Previously, I worked in some tech giants like Alibaba and also as well as some startups. And I have great interest actually in optimizing the influence and as well as the overall end-to-end product in generative media. I have been working with Stable Diffusion, with Flux and now with other open source video and image generation models. I believe that, previously, I was very good at building softwares for inference optimization but I realized that only by building an end-to-end product we can bring the best quality and service to the world. So that's the reason why I started the company because only, for example, providing a software or an inference library is not enough. We believe that we have the ability to work for all the necessary components in a single product. That's why we build. Derek (03:13) Nice, nice. So Zeyi, you are the Founder and the CEO of Wavespeed AI. Can you tell us ⁓ how Wavespeed come to be? What's the problem you are seeing at that moment and what problem you are trying to resolve? Zeyi (03:27) Yeah, I think it was in 2024, we witnessed a great blossom of the cutting-edge image-generation model called Flux. So last year, when Flux came out, only very few people had the ability to run it with a reasonable speed and output quality. And at the time, I built a solution. It actually was not open-sourced but it was very good and after that we realized that there would be a great chance for video generation models and we decided to take action to build the best inference solution for video models and we prepared for that. So we started the Wavespeed company at the start of the year. We are actually very lucky because just after several days we started the Wavespeed company, company open-sourced WAN video model. It was very successful and we have also at that time prepared the best WAN video inference service to the world and people just started to recognize us. Derek (05:07) That's nice, that's nice. So you just mentioned Flux, right, and Alibaba's model as well. So Wavespeed is known as an AI media-generation platform, right? In that case, can you just share your opinion why the users, they don't just go directly with maybe Flux, Sora, NanoBanana, all the other tools directly? Zeyi (05:32) Yeah, I think Wavespeed provides a unified user interface, or developer, or API interface to all of the users. We designed the input schema as well as the output schema very carefully so that I think nearly all of the models now have a very unified interface. That means that you can switch to any newest model or the models and the suits your demand very easily. Actually, you only need to change the name most of the time, I think. And another advantage is that we provide very high concurrency. I think developers don't need to worry about the concurrency problems anymore. So on other platforms, you get very limited concurrency, like 40, but we provide 500 or even 5,000. As long as you have the need, we can just increase it. Our backend is built by very high performance languages, so I think we have the best technique to support your need. Another key advantage is that we also have many models or many endpoints that you cannot get elsewhere. For the lip sync, InfiniteTalk, our Endpoints supports a very long generation. It's 10 minutes. That's why many people choose our services, because you cannot get such long video advertising services elsewhere. We also have many other interesting stuffs like WAN 2.2 spicy or even video extending. Video extending is our latest creation and I think many people love it and also it is now exclusive on Wavespeed AI. Derek (07:27) Gotcha, you just mentioned the InfinityTalk, the lens. So I want to double confirm, I think in the past the lens of the generated video is a pain point, like you can only have like 15-second short video, but for the long time video, it is a problem. Do you still see the challenge here now? Zeyi (07:47) Yeah, I think partially because there are many different sorts of video generation models and some are just driven by a bad prompts and some are driven by some other inputs like audio or reference video and I mean always reference post video or reference audio input, I think most of the time the model can generate very smooth videos. I mean, just people need to realize that with YouTube at least. Nice, nice. It's audio-driven conversational AI. Derek (08:24) Mmm. Zeyi (08:28) It's different than Veo3 or even than Sora 2, but it just can't outperform them in the area that it is good at. Derek (08:41) Nice, nice, nice. And before the podcast, I also had a chance to try Wavespeed AI personally. I like it very much. For me, personally, I would see Wavespeed AI as an access layer of the AI generation. So do you agree with my description of your platform? What do you think? Zeyi (09:00) Yes, the current business model works like an access layer and most people just use it this way. But under the hood, I think we have many different things. The most crucial difference between other platforms is that we also have many unique services, unique models that created by us. Video extending like InfiniteTalk or like some amazing image and video upscalers. We are unique and even with the same open-source models, I think we always, we often outperform our competitors in output quality because we are very good at adjusting the parameters of those models so that we not only outperform in terms of speed but also in other aspects like stability and quality and as well as duration or even functionality. Derek (09:58) Oh, okay, that's nice. You have mentioned that you built a unified interface for the users. At the same time, you also mentioned your backend is a very strong one that can handle a lot of concurrences. During the journey, what do you see the most challenging problems so far that you have overcome to build this platform? Zeyi (10:21) Yeah, many I think many, you know many people that are working to be working with their own products are still working very hard at dealing with their Own solutions to push their models on their own GPU farms. but I said It is actually quite harder than people can imagine because you need to deal with many very very rare cases like CUDA driver errors, hardware errors and many other stuff you need to ensure that under any circumstances your service can keep working and can scale to the traffic that you can have. It's very difficult and we realize this and we want to build a service that can help people deal with that. I mean we want to just ensure that developers don't need to worry about those things and they can keep paying attention on the things that they are really good at like building the frontend, building the business logic. Derek (11:30) Thank you for the sharing. So AI media generation started from a small market. A lot of people are very impressed by this technology, right? But now it is growing amazingly fast, right? It is growing ⁓ rapidly. And we see a lot of new players like 4, like Runway. So how does Wavespeed AI differentiate yourself from the other platforms? Zeyi (11:55) I think the other platforms often emphasize themselves as the best platform at optimizing the inference speed or just emphasizing other aspects. I think we do not emphasize on a very typical area. We want to seek a balance between the cost, speed, and the quality, as well as functionality. And we want to be the best. I mean, best means that the people can really integrate your API into their products. Your API are not demos. It is carefully, it must be carefully designed. When you just build your API, I mean build my API, build the API on Wavespeed AI, we investigate into the market, how do the end users actually want to use their, use your API to create their own creation is very important. What sorts of creation do you actually want to generate? Commercial market, advertisements, short videos, I mean we just investigate into that and we provide a very simple interface for people to use. That's very different because I see other platforms often provide services that are over-complex or just can act as a very small component in the final product. That's not enough and we want to ensure that people can simply adopt our own solution. Derek (13:34) So it is developer-friendly. Zeyi (13:36) the Developer-friendly and the product-friendly. Derek (13:39) Nice, nice, glad to know. Thank you for that. and What do you think the AI media generation state is today? Do you think it is mature enough or do you still see any like technical challenges or bottlenecks that are still frustrating? Zeyi (13:55) I think with the latest development during the past one year, the output quality of those models actually become usable for AI native companies, creators, but it is still very hard for people that are not so familiar with artificial intelligence and even software programming. I think now, only still a very small number of people or companies can utilize this. Derek (14:26) Gotcha, I totally agree. Now, let's talk a little bit about your business strategy. I see you recently hosted a hackathon in Tokyo, right here. It seems that you value the Japanese market a lot. Can you just share your plan and your strategic thinking behind that? Zeyi (14:47) Yeah, we have some thoughts actually. So we recently observed that maybe API isn't the best solution for entering every market that we want to ensure that individuals or companies can utilize our services even if they have no knowledge of how to integrate an API. So that we probably will publish more solutions like a simple AI studio or some people just call that all-in-one website. All-in-one website will be a very great idea because it can make people just try different models. They compare the output of different models very easily. And we will try, give a try with that. That's what we will publish for some typical areas. And now I think we only still we only have an English version. And in the future we will provide more languages support because we see that some clients from Japan and some from Middle East or European countries are not so good at visiting an English website and we will solve this problem and often they do not use English to search on the website too. We will try this solution to optimize our ranking on Google. Derek (16:24) Cool, that's very awesome. It's very nice to have a CEO like you who thinks about the user experience to publish the tools instead of just APIs and provide different languages for the local market. So really appreciate that. So I also want to get your opinion. For Agora, when we do the business in audio and video, We often see regional preferences. For face beauty, Asian people have a different preference than the European people. So do you get the same feedback on the AI media generation? Zeyi (16:58) Yeah, so for the cooperation with Agora, I know Agora has some very excellent voice models and we will try cooperating with Agora to see if we can work with these models on our platform. I think voice models may have some quite different requirements when we want to publish on our own platform. because people will need streaming output and some people also want very low latency and we will try to see that if our infrastructure can meet this sort of requirement. If not, we will try improving them. I think it is a very good chance to test the ability of our platform. Another question? Derek (17:49) Yeah, so I mean do you see the preferences and the differentiation between the regions? Yeah. Zeyi (17:57) Between Japan and other regions. Actually, I think the users are quite different as well as the models. We can see that the models published by different companies always have different preferences. So for example, Baidu’s models, Derek (18:00) I I'm doing. Zeyi (18:17) C-Dream are very good at generating Asian faces. Some people think that it is not actually very well in generating the Western or European or African faces. And the Flux models also have a very unique style, it means that it is not very realistic. Some people like realistic style, especially in America, in the United States, as well as in Europe. It's very important to make sure that the image generation model can generate a very realistic, I mean, like just a shot by a phone. It's a typical style. I mean, people just love that. Derek (19:03) Nice, nice. So I want to also learn from you like in AI video generation. For the output you just mentioned right so how do I know? How do I measure the video is good or not? Zeyi (19:15) So you already I think you have different measurements. So for example one is the resolution. Resolution is a very crucial factor. So if you reduce the resolution you can actually make the inference very fast. But I think most people now want to make sure that the output is not very essentially is not very blur. So that they might need to generate at a higher resolution or if they choose to generate with a low resolution they can use a video upscaler after that to upscale the video. That's why we now have three different video upscalers on our platform. They are very different at the speed of the quality and the cost. We want to make sure that people can choose their own. Derek (20:03) For whom they will like to generate a low resolution video and then upscaled it again. Zeyi (20:10) Yeah, I think generating low resolution videos, you just verify the quality of it manually. And if the overall content of this video is good enough for the user, you choose to upscale it with video upscaler. It's more cost efficient than just generating with higher resolution. Because if you choose to generate with higher resolution, but you see that that the emotion or the face resemblance is not good enough, you actually waste more money. Derek (20:44) So it's for cost efficiency. Zeyi (20:46) Cost for efficiency as well as for speed. Sometimes you know because most video generation models are diffusion based and they use full attention. Full attention can be extremely slow if the number of tokens are too much. Video upscaler can be used as sparse attention. Optimized methods to optimize its speed so that it can be more cost efficient than utilizing end-to-end high resolution. Derek (21:21) Exactly, that's the biggest I take away today. So I learned from it. Now I know. So yeah, I see you mentioned that you have built the tools like InfinityTalk, like the Upscaler. You are building a lot of tools if I look into your platform. So what's the motivation behind that to build so many tools for our developers? Zeyi (21:43) Yeah, because now I think every model now has quite different abilities. Even if ⁓ they are designed to do the same task, like generating images with a typical figure, they can deliver quite different ⁓ qualities. As I have mentioned before, some are good at generating Western faces. Some are good at generating Eastern faces. When talking about image editing models, they are also quite different too. Some are just good at keeping their faces. And some deliver very good closing texture. They are quite different, so that people must, I think when developers now want to build a practical solution in their application, like a virtual trial, or face-swapped, or generating selfies. They must compare very carefully to see which model can meet their needs best. And sometimes they need to switch to different models. So, for example, their end user is in Japan or in China or Korea, need to use the Seedance. But when they have some European consumers, they need to switch to another model. It's very crucial for them. Derek (23:13) So we also want to learn from you. Recently, we see a lot of amazing features from other models, video models like the Sora 2 released Cameos, right? And Veo 3 also tried to launch their process object-editing features. So what would be the plan for Wavespeed AI in the next release? Zeyi (23:34) Yeah, and that's really is I think currently we don't still don't have a very good web interface for people to try to see the output quality of it. And the comparison of different models we will build another unified web interface for people to see it's a Unified AI Studio that you can either switch to different models, different tasks like image generation, video generation, you don't need to switch to another web front-end page like what we currently have on our website. Now, users only need to switch to click on the different buttons and switch to different tabs in a unified web interface. That's the thing that we are currently working at that we will try combining vision language models with the generative model to see that if we can help our end users to help the... solve their problems in their typical areas like making advertisements or creating short videos. And we are trying to see that if we can help them optimizing their problem automatically. I know this could be very easy if you just think this idea, but the implementation could be very hard because you need to be very good at knowing the limitation and the requirement of different models because now we have quite different models just like for example when we are talking about video models that can generate a synchronized audio. Now, we have three models or four models, LTF2, and Sora. Actually, they are quite different. Some of them can understand the very complex instructions and input schema. While some of them can only understand the simple prompt. And you need to build a very specific optimization strategy for those models. Derek (25:54) Yeah, yeah, gotcha, gotcha. Yeah, as I mentioned it once, but it is really happy to see Zeyi as the founder and the CEO to always think about developers, right? Always trying his best to bring the best experience to our developers. So really appreciate with that. So recently I got an idea. I think people are talking about if there will be a chance in the near future that for us we can interact with the characters in the video. Do you think that would be impossible in the near future, like we are a part of the movie? Zeyi (26:27) Yeah, I think that will require very strong inference optimization techniques because you need to make this real-time interactions. Yeah, and it's actually still very challenging and maybe the best solution is not to run these models on the cloud server. Maybe we need to give some, deliver some solutions for people to try this on their own desktop computer. Yeah, but yeah, I think overall the most important factor is still the output quality because if you just ignore this, if you just use some very small, very tiny models, I think now we can make the streaming output very easy. But now we see that many people still want to keep the output quality as high as possible. And they want, for example, to ensure that the quality is on par at least with the 4 Wan 2.2 model. So it's still very challenging. I think currently still we have a very long way. Derek (27:33) Nice, gotcha, gotcha. It's a pleasure to talk with you today. Before we wrap up, would you like to give a few words to our audiences, our developers, creators who just started AI businesses or they just learned about AI? Zeyi (27:49) Yeah, I think it's very important that you work natively with AI. So when you now want to build your own application or website, I think now utilizing some very heavily AI techniques in your own organization is very important because that's what we are currently doing in our own company. I think now we do not need we can offload over 90% of the coding work, like Claude Code or OpenAI Codecs and people just need to realize this and they need to make sure that their code structure and project structure are friendly enough to these tools. So you need to think very deeply of how to utilize those tools because if you build wrong project structure, it will be very hard to get the optimal result from those tools. It's important for people who want to start very quickly. Choose the right solution, the right way of building their software. And this is one aspect. And another aspect is that now if you want to try those image and video generation models, if you are sure that you are not very good at optimizing this, so API platform is the best solution right now. Derek (29:27) Thank you for the advices. These are very useful. It is a pleasure to have you on podcast today, Zeyi. I personally wish Wavespeed and your team a continued success in the business. Thank you for everyone tuning in and watch the episode today and let's meet you in the next episode. Bye Bye bye everyone.