Joshua Xu is co-founder and CEO of HeyGen, an AI video creation and translation platform that recently raised $60 million at a $500 million valuation from investors including Benchmark. HeyGen lets users create, localize, and personalized video content using AI-generated avatars, eliminating the need for cameras and traditional video production workflows. Joshua discusses HeyGen’s viral moments, the technical challenges of building engaging avatars, how enterprises are using AI video tools, the company’s approach to trust and safety around voice cloning, competitive dynamics with platforms like TikTok and Snap, and his vision for where video creation is headed by 2030.
HeyGen’s Viral Moments and the “Magic” of AI Video
HeyGen has had several viral moments, most notably when public figures like Elon Musk used the platform to dub a speech by the president of Argentina into multiple languages at the World Economic Forum.
Joshua describes these moments as unexpected — the team focuses on shipping product improvements and listening to customers, trusting that a breakthrough moment will eventually resonate with the market.
The first “magic moment” for Joshua personally was creating his own avatar and watching himself speak on screen — a moment that convinced him of the product’s potential.
He now uses his avatar internally for things like product update emails, finding it far easier than recording himself on camera.
The Future of AI in Video Production
Joshua frames HeyGen’s mission as replacing the camera in video production. Traditional video requires filming with a camera and then editing footage — generative AI changes both steps.
Instead of filming, AI can generate footage directly. Instead of timeline-based editing, future tools may use script-based or document-style interfaces to assemble videos.
HeyGen started from the belief that video is just binary data (zeros and ones) and that machines can learn to generate it — even before the term “generative AI” was widely used.
The company’s north star is AI quality: can generated footage truly replace camera footage? This depends on lighting, lip sync, body motion, gesture, and whether the presenter’s expressions match the script.
Joshua pushes his team with a simple test: would you use this avatar in your own day-to-day work?
HeyGen’s Use Cases and Customer Base
HeyGen has over 40,000 customers who primarily use the platform for three things:
Create: Generate videos using avatars (personal or stock) by typing text — no camera needed.
Localize: Translate existing videos into over 175 languages and dialects while preserving the original speaker’s voice tone, facial expressions, and lip sync.
Personalize: Take one video and generate over 100,000 variations tailored to specific customers, industries, or pain points — similar to how email personalization works.
HeyGen is not built for professional video editors. It’s designed for the 99% of users who aren’t video professionals — marketers, content creators, sales teams, and others who have ideas but lack camera equipment or editing skills.
Making AI Video Accessible to New Users
Onboarding new users is a challenge because HeyGen represents an entirely new way of creating content.
The company invests heavily in showcasing what’s possible through use case demonstrations across marketing, sales, customer success, training, and creator verticals.
The goal is to get new users to their specific use case as quickly as possible and show them a “magic moment” that demonstrates the product’s value.
What Makes a Good Avatar
The central question for avatar quality is: is the footage engaging?
In business contexts, a video’s job is to deliver a message effectively — if viewers drop off after a few seconds, the message isn’t landing.
Engagement comes from coordinated expression: mouth movement, head movement, eyebrow motion, body gesture — all working together naturally.
Joshua’s mental model: humans don’t just speak with their mouths; they speak with their whole body. Replicating that coordination is the hardest technical challenge.
HeyGen builds all avatar models in-house, including lip sync, body motion, and full-body rendering.
Their Avatar 3.0 model renders the entire body, not just the face. Gesture is the next frontier.
Training involves large-scale video data of people speaking, with the model learning to reproduce talking styles — not just mouth movements but head tilts, gestures, and expressions.
Personal video avatars require only 30 seconds to 2 minutes of footage to learn someone’s talking style.
The team is working on capturing different “modes” — presentation mode, interview mode — and eventually letting the avatar adapt its behavior based on the script content.
Interactive and Synchronous Avatars
HeyGen has a beta interactive avatar product that can attend Zoom meetings and interact in real time.
The main technical challenge is inference speed: as models grow larger and more complex, generating responses fast enough for real-time interaction is difficult.
Joshua is optimistic that within 12 months, real-time avatar generation will be feasible, even on-device.
The use case he’s most excited about: personalized video ads. Instead of every viewer seeing the same ad from a brand, the ad could adapt based on the viewer’s preferences and watch history.
HeyGen’s Approach to Video Generation vs. Text-to-Video Models
HeyGen focuses on business videos, where control, consistency, and brand quality matter most.
There are two technical approaches to AI video generation:
End-to-end pixel generation (like Sora, Pika): generates video frame by frame from text prompts.
Orchestration engine (HeyGen’s approach): assembles video from modular components — script, voice, music, avatar, footage — giving businesses precise control over each element.
HeyGen believes the orchestration approach is better for enterprise needs because it ensures brand consistency and quality control.
The company sees text-to-video models as important components that HeyGen can integrate as building blocks within its orchestration layer, rather than competing with them directly.
Brand Personalization
An area Joshua sees as a major future opportunity: brand personalization for video.
Today, tools like ChatGPT can write in a brand’s tone if given context. Video can’t do this yet.
The vision: feed HeyGen a URL or past videos from a company, and the AI learns the brand’s color palette, style, opening sequences, and visual language — then bakes those into newly generated videos.
This would work similarly to how LLMs use context windows: brand inputs become memory that gets embedded into the model.
Competing with Industry Giants
Joshua previously worked at Snap, and HeyGen competes with platforms like TikTok and Snapchat that have massive distribution.
His view: incumbents and startups are not competing in the same market. Incumbents built tools for people who already have cameras and editing skills. HeyGen is creating a new market for people who don’t.
Platforms like TikTok face a growing dilemma: they’re built around human creators, but as AI-generated content grows (currently maybe 10% of content), ranking and recommending it alongside creator content becomes a problem. If AI content reaches 50%, creators get less visibility.
Joshua believes this tension will eventually require a new platform purpose-built for AI-generated content — though building a consumption platform is not HeyGen’s mission. HeyGen is focused on being the creative tool layer.
Enterprise Push: Lessons and Surprises
HeyGen has made a significant enterprise push in recent months.
Enterprise customers have much higher quality requirements — brand consistency and output quality are paramount.
Integration into day-to-day workflows is critical. HeyGen has partnered with platforms like HubSpot, embedding its tools into existing marketing and CRM ecosystems so users can pull data and ship videos without leaving their workflow.
Joshua was surprised by how much attention enterprise customers pay to avatar quality details — especially gesture accuracy over longer videos. One European customer gave detailed feedback about how the avatar’s gestures drifted from the script after the first few minutes of a six-minute video.
Trust and Safety in AI
Trust and safety is a core business priority, especially given HeyGen’s enterprise customer base.
Avatar creation safeguards:
Every avatar requires a video consent recording.
AI matches the consent video to the submitted footage to verify identity.
A dynamically generated passphrase expires every 10–15 seconds, making it nearly impossible to create someone’s avatar without their consent.
Content moderation:
HeyGen prohibits hate speech, misinformation, and political campaign content.
Moderation uses a hybrid approach: AI model review plus human moderation team review.
IP partnerships:
HeyGen has partnerships with actors who have authorized their likenesses on the platform as stock avatars.
Joshua sees potential in AI-generated IP — entirely new AI-generated characters that maintain consistency across generations. These could become the influencers and IP of the future (like AI influencer Michaela on Instagram).
Fundraising and Financial Strategies for AI Startups
Joshua has thought carefully about how much capital HeyGen needs.
The two biggest cost factors are GPU compute and talent.
Unlike traditional software (where marginal cost per customer is near zero), AI has significant marginal costs because each generation consumes GPU resources.
However, AI-native companies are far more efficient: employees using tools like ChatGPT get much more done. Joshua himself uses his own avatar for customer testimonials and his AI for product updates.
He points to ChatGPT reaching 100 million users as evidence that AI companies can grow faster with less capital than previous generations of software companies.
HeyGen offers a free tier and designs products 12 months ahead of current cost curves — building features today that will be economically viable when inference costs drop.
He doesn’t wait for costs to come down before building; he builds for where costs will be.
Quickfire
Overhyped: The speed at which AI will deliver massive value in enterprise. People expect transformation to happen faster than it realistically will.
Underhyped: The ultimate impact of AI. The long-term transformation will be enormous even if adoption takes time.
Changed his mind on: Early in the company’s history (around 2021), HeyGen invested in 3D modeling as the path to video generation. After seeing Stable Diffusion, Joshua shifted to pixel-based generation, recognizing it would advance much faster due to its ability to train on large-scale data.
Customer surprise: How much attention customers pay to avatar gesture details, especially in longer videos where gestures can drift from the script.
Most exciting AI product outside video: Google’s Notebook LM, which can turn blog posts or URLs into podcast-style audio conversations with two AI voices. Joshua has already used it to convert his weekly product updates into a podcast format for his team.
Vision for 2030
Joshua’s mental model for HeyGen’s evolution: imagine having a personal video agency sitting next to you at all times.
You describe your ideas, the agency films the footage, edits it, presents it, incorporates your feedback, and delivers the final video.
By 2030, everyone will have this “video agency in their pocket” — an interactive AI experience that feels like working with a personal video production team.
Just as the mobile camera in 2012 led to unforeseen platforms like Instagram, Snapchat, and TikTok, AI video tools will open up entirely new categories of content and use cases that are hard to imagine today.
Joshua’s personal motivation: he gets the most joy from building tools that other people use to create things on their own.