Learn how to use HeyGen video and audio dubbing, precision lip sync, and multilingual avatar videos to translate content and scale your message worldwide.
If you’re already creating great videos in HeyGen, translation is where things really start to scale.
This Bootcamp session with Onee Yekeh (Product Manager) and Andrei Gusev (Research Engineer) was all about one question: How do you take the videos you already have and make them work in every market you care about without rebuilding everything from scratch?
Here’s the breakdown of what they covered and how to think about translation inside HeyGen.
Two ways to translate: Audio dubbing vs video dubbing
When you go to create → translate a video in HeyGen, you’ll see two options:
- Video dubbing
- Audio dubbing
They both start the same way:
- Upload a video file or paste a YouTube / Google Drive link
- Choose your target language(s)
- Click translate
But what you get back is different.
What video dubbing does
Video dubbing is the full experience:
- Translates your speech into the target language
- Clones your voice
- Generates the new audio
- Resyncs your lips and facial movements to match the new language
This is what most people think of as the “magic” HeyGen translation feature: your face, your voice, speaking another language with natural lip sync.
Video dubbing is ideal when:
- Your face is clearly visible and speaking to the camera
- Lip sync realism matters (courses, CEO messages, marketing videos)
- You want viewers to feel like you really recorded in that language
What audio dubbing does
Audio dubbing:
- Translates and dubs the audio
- Keeps your original video visuals exactly as they are
- Does not adjust the lips
This is useful when:
- The speaker is small on screen or not the main focus
- You’re working with footage that doesn’t need lip sync (screen recordings, B-roll-heavy edits)
- You care more about quick multilingual audio than hyper-realistic lips
You can access both from:
- Create → translate a video, or
- Apps → translate video in the HeyGen interface
Speed vs precision engines: When to choose which
Inside video dubbing, HeyGen gives you two translation engines:
- Speed
- Precision
They both translate your video, but they optimize for different things.
Speed engine: For everyday use and fast turnaround
The speed engine is designed to optimize:
- Cost
- Latency (how long it takes)
- Standard-quality lip sync
Roughly speaking:
- A 1-minute input video takes about 4–12 minutes to process
- It’s great for:
- Talking-head videos with moderate movement
- Day-to-day content
- Social clips, regular updates, internal comms
If your video doesn’t have wild motion, multiple speakers, or tricky angles, speed will usually do the job.
Precision engine: For quality, complexity, and tough footage
The precision engine is built for maximum quality. It uses HeyGen’s latest, more advanced models to handle:
- Wide angles and side angles
- Lots of head movement
- Objects partially blocking the face (hands, mics, props)
- Multiple speakers with tighter timing requirements
- Lower-quality original audio
It’s particularly good for:
- Multi-speaker videos (panels, interviews, group content)
- Educational content (lectures, explainer videos)
- Footage where the camera or subject isn’t perfectly framed straight-on
If your video is visually or structurally complex, precision is the engine you want to try.
Main languages vs local variants: How target language actually works
When you pick your target language, you’ll often see:
- A “general” version of the language (e.g., English general)
- One or more locale-specific variants (e.g., English US, English UK)
- Additional accents or regional forms in a “second tier”
Andrei explained why: there’s a trade-off between voice similarity and accent consistency.
When to use the general language option
General options (like English general) are optimized to:
- Preserve your speaker identity as much as possible
- Keep your voice clone sounding like you in the new language
Choose general when:
- You care about the translated voice sounding recognizably like the original speaker
- The exact accent (US vs UK vs neutral global) matters less than identity
When to use locale-specific variants
Locale options (like English US, English UK) focus on:
- Accent consistency
- Naturalness for that specific region
Choose locale variants when:
- You’re targeting a specific market and want the accent to match audience expectations
- You’re less concerned about the voice being a perfect match to your original and more focused on local feel
In other words:
- General → “Make it sound like me”
- Locale-specific → “Make it sound like a native of this market”
Advanced translation settings: Collections, captions, and more
Out of the box, you can:
- Upload your video
- Choose a target language
- Click translate
But there’s also an advanced options panel that’s worth knowing about.
Key settings include:
- Dynamic duration
- Improves quality by flexing the video’s duration slightly to better fit the translated speech timing
- Lip sync toggle
- Turn lip sync on or off if you want only audio changes
- Remove background sound
- Optionally strip or reduce the original background audio
- Resolution
- Maintain resolution up to 4K for the output
- Collections and multilingual player
- If you select multiple target languages and keep collection turned on:
- HeyGen creates a group of translations
- You get a multilingual player where viewers can switch between languages (e.g., French, Arabic, English original)
- If you select multiple target languages and keep collection turned on:
- Captions and enhanced voice
- You can enable captions, enhance the voice track, and more, depending on your use case
You can always add more translations later by hitting modify on an existing translation, selecting new languages, and regenerating within the same collection.
Proofread: Edit translations before you generate the final video
Sometimes “translate and go” isn’t enough. Maybe:
- You’re working with regulated content
- The script is heavy with domain-specific terms
- You have in-house language experts who need to review
That’s where proofread comes in.
How proofread works
When you start a translation:
- Upload your video
- Choose your target language(s)
- Instead of clicking translate, click review and edit
HeyGen will:
- Process your video and generate:
- Input transcription (original language)
- Output transcription (translated language)
In the proofread interface (available for team and enterprise users), you can:
- See both original and translated text side by side
- Edit the translated text directly
- Use search and replace for bulk fixes (e.g., standardizing a term)
- Adjust timing in the timeline, nudging speech speed or segment splits
- Preview the result in context
- Swap the voice used for the translation, if needed
Once you’re happy, click generate result and HeyGen creates the final dubbed video with your reviewed translation baked in.
Inviting native speakers to review
For teams working with external reviewers or local experts, there’s an invite proofreader option:
- Give someone view or edit access to a specific project.
- They can:
- Log in
- Review and adjust transcriptions
- Preview audio and timing
- Approve and generate the final video
This is especially useful when you don’t speak the target language yourself but want a native speaker to sign off.
Handling multiple speakers and overlapping speech
What about videos with more than one person?
According to Andrei:
- Multiple speakers are supported
- The system can:
- Identify different speakers
- Translate each separately
- Keep timing aligned
- The system can:
- The hardest case is overlapping speech (people talking over each other)
- This is challenging even for humans
- If speech is overlapping heavily, expect some limitations
Rule of thumb:
- If you can clearly tell who is saying what, HeyGen’s translation engine will generally handle it well
Multilingual avatar videos vs video translation: What’s the difference?
This is a big source of confusion, so Onee broke it down clearly.
There are two different translation concepts in HeyGen:
- Video translation (dubbing)
- Multilingual avatar videos in AI Studio
They both involve language, but they work in different layers.
Video translation: transform an existing video
Video translation is what we covered above:
- You start with finished footage (camera-shot or previously generated)
- You translate the audio track (and optionally lips)
- You get back a dubbed version of that same video in another language
This is post-production: You’re changing a rendered video.
Multilingual avatar video: translate the script and regenerate
Multilingual avatar video lives inside AI Studio and works differently:
- You have an existing AI Studio draft (avatar + script + elements)
- You click translate in the studio
- You choose target languages
- HeyGen:
- Translates your script
- Translates text on the canvas (titles, labels, etc.)
- Creates new drafts for each language
From there:
- You can open the new draft in, say, French
- Review or tweak the translated script and on-screen text
- Click generate to create a video natively authored in that language
Key difference:
- Video translation → Takes a finished video and dubs it
- Multilingual avatar video → Translates the project (script + visuals) and generates a fresh video in that language
They’re complementary, not competing. Use:
- Video translation when:
- You already have a final video file
- You don’t want to reopen or redesign the project
- Multilingual avatar video when:
- You’re still in AI Studio
- You want on-screen text, captions, and layout to change with the language
- You’re comfortable generating separate drafts per language
Real-world improvements: What the precision engine can handle now
Onee closed the session by showing what the newest precision model can handle in practice:
- A science demo with a cloth in front of the face → Precision kept lip sync natural even with facial occlusion
- A clip with lots of head movement and angle changes → The lips stayed aligned as the speaker moved
- A wide-angle shot and side-profile video → The model applied lip sync convincingly even when the face wasn’t straight-on
- A lower-quality original video → Precision still produced realistic lip sync in the translated version
The takeaway: If your footage is even slightly “difficult” (movement, side angles, objects, multiple speakers), precision is worth trying first.
Best practices for multilingual workflows in HeyGen
To wrap it up, here’s a simple way to think about your translation options:
- Start with what you have.
- If it’s a finished video file → use video dubbing
- If it’s an AI Studio project → consider multilingual avatar drafts
- Pick the right engine.
- Speed for simple talking heads, everyday content
- Precision for complex visuals, multi-speaker, or “hero” pieces
- Choose the right language variant.
- General for preserving your voice identity
- Locale variants for specific markets and accents
- Use proofread when quality really matters
- Side-by-side text
- Native reviewers
- Voice swaps and timing adjustments
- Keep languages separated by scene when mixing
- HeyGen can technically handle multiple languages in one scene
- But for right-to-left and left-to-right combos (like Arabic + English), it’s usually easier to keep them in separate scenes so animation markers stay sane
If you’re already comfortable creating videos with avatars in HeyGen, translation is the lever that lets you:
- Reach new markets
- Repurpose your best content without reshooting
- Give every region a version that feels made for them
The simplest way to start:
- Take a one-minute video you already like
- Go to translate a video
- Choose video dubbing, pick precision, and select one target language
- Let it process, then watch the multilingual player
Once you see yourself, or your avatar, speaking another language naturally, it becomes obvious how much global leverage you can get from every single video you make.








