Most talking avatar systems start from a single photograph. One image, one angle, one expression. From that frozen frame the model has to guess everything else: how you look when you turn your head, how light falls on your cheekbones, what your face does mid-sentence. It works surprisingly well, but there is an obvious ceiling.
TAVR changes the starting point. Instead of a single photo, it takes a short video clip and uses that much richer signal to build a more faithful model of your identity. The result is an avatar that looks more like you, moves more like you, and generalizes to new backgrounds and scenes without losing who you are.
Why Video References Matter
A single image captures one moment. A video captures dozens of angles, expressions, and lighting conditions in a few seconds. TAVR exploits this by accepting multi-frame video references of flexible length. As the number of reference frames increases from 12 to 48, identity similarity improves continuously—without degrading lip sync or general video quality.
The key advantage is cross-scene generation. Most of the previous methods struggle when the target background differs from the reference: they either lose identity fidelity or produce visual artifacts from a two-stage edit pipeline. TAVR handles this natively by conditioning on video references while accepting a separate background image, producing avatars in customized scenes through a single pass.

Figure 1: Video-reference generation yields significantly better identity preservation compared to single-image baselines. Attention heatmaps show how the model selectively aggregates salient identity cues—lip shapes, facial silhouettes—from highly correlated reference frames.

Figure 2: Performance with reference frame count. (a) Identity similarity improves continuously from 12 to 48 frames. (b) Lip sync scores remain stable regardless of reference length. (c) General video quality is unaffected, proving the identity gain does not compromise generation stability.
The quantitative improvement is visible in the generated videos. With fewer reference frames, the model lacks enough visual evidence for fine details and resorts to hallucination:

Figure 3: Visual comparison across reference lengths. At 12 frames, the model hallucinates teeth artifacts (orange arrows) due to missing inner-mouth priors. At 48 frames, the richer context provides explicit visual cues, eliminating these artifacts.
Architecture
TAVR is built on the Wan2.1-T2V-14B video diffusion backbone, adapted for talking avatar generation. The framework integrates cross-scene video references into the generation pipeline through four key components:
- Flexible video referencing: The model accepts variable-length video references, dynamically leveraging temporal context. More frames means more identity information—diverse poses, expressions, and lighting conditions all contribute to stronger identity preservation.
- Token selection module: Video references introduce a lot of tokens. The token selection module filters them in latent space, using facial bounding boxes to retain only the most identity-relevant tokens while discarding background and redundant information. This keeps compute manageable.
- Reference Self-Attention: The standard self-attention layer is reformulated so that the target generation and reference tokens can jointly attend to their combined context. This injects rich identity cues into the generation stream without separate cross-attention modules.
- Audio Cross-Attention: A frame-wise cross-attention mechanism aligns driving audio with the generated frames for lip sync, while simultaneously injecting the reference audio to establish temporal audio-visual correspondence within the reference stream.

Figure 4: Overview of the TAVR framework. Cross-scene video references are encoded by the VAE, filtered by a Token Selection module, and processed through adapted Transformer blocks with Reference Self-Attention and Audio Cross-Attention.
Long Video Generation
The base model generates clips of about 3 seconds. For longer videos, TAVR uses a motion frames strategy: it extracts the final latent frames from the previous clip as motion priors for the next, ensuring smooth dynamics across generation windows. A global appearance anchor, achieved by replacing current masked background tokens with the very first latent frame, prevents identity drift over time.
Three-Stage Training
Cross-scene video references introduce a domain gap. A reference video shot in a studio looks very different from a target scene on a city street. TAVR bridges this gap through progressive training:
- Stage 1: Same-scene pretraining. The model learns foundational appearance copying from large-scale intra-scene video data. Reference and target come from the same clip, so the model focuses purely on learning to reproduce identity and motion.
- Stage 2: Cross-scene fine-tuning. Reference and target are now sampled from different videos of the same person. The model is forced to learn genuine identity aggregation rather than superficial pixel copying, bridging the domain gap between different environments.
- Stage 3: Reinforcement learning. A task-specific DPO stage uses ArcFace identity similarity as a reward signal. A spatially masked objective focuses the reward exclusively on the foreground avatar region—because measuring identity fidelity against background pixels would dilute the signal and introduce spatial noise.
Cross-Scene Benchmark
Existing talking avatar benchmarks use single-image references from the same scene—they cannot evaluate the robustness of cross-scene video-reference generation. We constructed a new benchmark of 158 high-quality cross-scene video pairs curated from TalkVid, where each pair shows the same person in visually distinct environments. Strict facial consistency is enforced via ArcFace thresholding, and background discrepancy is maximized by retaining only the pair with the lowest background PSNR per identity.
Results
TAVR consistently outperforms existing methods across all metrics on the cross-scene benchmark. We evaluate against five state-of-the-art methods under two reference paradigms: using the raw cross-scene image directly, and using an edited image adapted to the target scene.

Table 1: Quantitative comparison with state-of-the-art methods on the cross-scene benchmark. TAVR yields the best identity similarity and achieves the highest overall generation quality (16.42) while maintaining competitive lip synchronization, significantly outperforming all baselines.
TAVR with 20 reference frames achieves an overall quality score of 16.42—significantly ahead of the next best method HuMo (14.13). With 48 frames, identity similarity reaches 0.83 (reference) and 0.69 (target)—the highest of any method. Lip synchronization remains stable across all reference lengths, and visual quality metrics show no degradation from the increased identity conditioning.

Figure 5: Qualitative comparison. TAVR produces more faithful identity preservation and higher visual quality compared to all baselines, particularly in cross-scene settings where the background differs from the reference.
Why This Matters
TAVR demonstrates that the shift from image to video references is not incremental—it is a qualitative leap in avatar fidelity. By conditioning on temporal identity signals rather than a single frozen frame, the model captures the way a person actually looks across expressions and poses, producing avatars that are more recognizable and more natural. This directly enables HeyGen's video-reference avatar product, where users can create high-fidelity digital avatars from short self-recorded clips rather than studio photography sessions.
All videos shown in this report are for research demonstration purposes only. HeyGen's platform enforces consent verification for all digital twin creation.