HeyGen

Generating a talking avatar that actually looks and moves like a specific person is harder than it sounds. Most systems start from a single photo: one angle, one expression, one lighting condition. The model has to guess everything else, which leads to faces that drift, lips that don't quite sync, and motion that feels generic rather than personal.

Avatar V takes a fundamentally different approach. Instead of extracting identity from a single image, it conditions on a full video reference and learns to reproduce fine-grained details through attention, the same way large language models learn from in-context examples. The result: 1080p avatar videos at 30 FPS, of any length, that faithfully preserve who a person is and how they move.

Avatar V has been deployed across 5,000+ GPUs and serves millions of generation requests.

Why single-image conditioning falls short

Nearly all existing avatar systems condition on a single static reference image. This provides shallow identity information and creates three recurring problems:

• Identity drift. The model hallucinates unseen angles and expressions, losing fine-grained facial details as the video progresses.

• Generic motion. When identity is a static embedding and motion is a separate signal, the system can't capture an individual's talking rhythm, habitual micro-expressions, or gestural tendencies. The avatar looks like the person but doesn't move like them.

• Underserved critical regions. Standard diffusion training distributes learning signal uniformly across the frame. But the regions that matter most for avatar quality, like lip shape, teeth, and eye gaze, occupy a tiny fraction of total pixels. They end up undertrained.

How Avatar V works: in-context personality learning

The core idea is to treat personality embedding as an in-context learning problem. Rather than compressing a person's identity into a fixed-size vector, the model conditions directly on the full token sequence of their reference video. At every transformer layer, it attends to the reference to extract appearance, expression, and motion details.

Sparse Reference Attention

Naively attending to all reference tokens alongside generation tokens would be prohibitively expensive: attention cost scales quadratically. Avatar V introduces an asymmetric mechanism where generation tokens attend to all reference tokens, but reference tokens only self-attend. This prevents noise from the generation process from contaminating the identity signal, and reduces complexity from quadratic to linear in reference length. The model can condition on long video references without a computational bottleneck.

Learning how someone moves, not just how they look

A convincing avatar needs to reproduce an individual's characteristic motion, not just their face. Avatar V introduces a dual-role motion representation that serves simultaneously as a generation target and a conditioning signal. This creates a closed-loop training signal: the model learns to understand each person's talking patterns and then reproduce them, including rhythm, micro-expressions, and gestures.

Recovering detail at high resolution

Fine facial details, including dental structure, skin texture, and precise lip shapes, are inevitably lost at base resolution. Avatar V includes an identity-aware super-resolution refiner that inherits the full video reference conditioning from the base model. It uses identity and audio signals to recover these details in a single denoising step with sparse temporal attention, keeping latency practical for production.

Training pipeline

Avatar V follows a progressive five-stage training pipeline. Each stage builds on the previous and introduces more specialized supervision:

1. Text-to-video pretraining establishes general video generation capabilities on broad data.

2. Audio-to-video pretraining introduces audio conditioning for speech-visual alignment.

3. Personality supervised fine-tuning uses curated cross-identity data with auxiliary losses for identity preservation, motion fidelity, lip-sync accuracy, and perceptual quality.

4. Two-phase distillation (classifier-free guidance followed by distribution matching) delivers over 10x inference acceleration.

5. Reinforcement learning from human feedback aligns output with human preferences across identity, motion, and visual quality dimensions.

Data curation at scale

Training data quality is a bottleneck for identity-preserving generation. Avatar V is backed by a data engine that orchestrates 25+ processing stages and 20+ specialized AI models to produce training data at three quality tiers, from broad pretraining corpora to post-training subsets scored across ten fine-grained dimensions.

One key innovation is cross-clip identity connectivity: the engine builds a graph linking same-identity clips across visually distinct scenes, so the model learns to separate who someone is from what they're wearing, where they are, or how they're lit. A 100+ annotator platform provides human verification throughout.

Production infrastructure

Avatar V runs on HELIOS, HeyGen's unified GPU infrastructure platform. HELIOS manages 5,000+ GPUs across 5+ cloud providers, 10+ regions, and 15+ standardized cells, supporting reserved, on-demand, and preemptible capacity under one system.

The inference stack includes VideoRef context caching, sequence parallelism, a fused operator library, FP8 quantization, and streaming VAE decode. Together, these enable chunk-based generation of infinite-duration video at 1080p, 30 FPS. The data processing engine alone handles 200K+ concurrent tasks.

After outgrowing Ray's centralized coordination at scale, the team built a purpose-built declarative engine inspired by Kubernetes, achieving 95%+ GPU utilization, sub-30-second failure detection, and zero-downtime deployments.

Results

On a cross-scene benchmark, Avatar V achieves state-of-the-art performance in identity preservation, lip synchronization, and generation quality, consistently outperforming existing approaches. The system serves millions of generation requests in production.

Looking ahead

The architectural foundations of Avatar V, particularly the unified token sequence and asymmetric video reference attention, provide a flexible substrate for future work. We see natural extensions toward real-time streaming, multi-person scenarios, and full-body generation.

Avatar V: Scaling Video-Reference