A model is only as good as its training data. For Avatar V, that means millions of talking-head videos spanning every combination of ethnicity, age, lighting, camera angle, and speaking style. But raw video from the internet is messy. Most clips are unusable: wrong framing, poor audio, multiple speakers, text overlays, synthetic faces, watermarks.
We built a distributed data engine—a pipeline of 25+ processing stages powered by 20+ specialized AI models—that starts from 50M raw videos and produces clean, labeled, identity-linked training examples. The pipeline branches into two paths: a large-scale pretraining corpus (100M+ clips) for learning general motion and appearance priors, and a curated audio-to-video fine-tuning corpus (10M+ clips) with dense avatar-specific annotations. Getting the data right was at least as hard as getting the model right, and arguably more important.

Figure 1: Data curation pipeline overview. Starting from 50M raw videos, the pipeline applies shared segment-level curation before branching into pretraining (100M+ clips) and A2V fine-tuning (10M+ clips) paths, with human annotation and cross-clip identity connectivity feeding into the avatar-specific branch.
Segment-Level Curation
Raw videos pass through a 10-stage cascade that progressively filters and annotates content before any expensive model inference runs:
- Normalization. Standardizes resolution (longest side 640px) and frame rate (25 fps).
- Temporal pre-filtering. Rejects choppy or static content via frame-difference analysis and perceptual hashing. CPU-only, eliminating degenerate content before GPU stages.
- Human detection. A joint object detector and face analysis model verifies human presence and defines eligible temporal intervals.
- Optical flow. Quantifies motion statistics that feed into the clipping optimizer.
- Visual quality assessment. Q-Align scores keyframes with continuous quality scores calibrated to human opinion.
- Smart clipping. Formulates clip selection as a constrained optimization problem, jointly maximizing clip duration while satisfying constraints on visual quality, motion, and face presence ratio—replacing brittle independent threshold cascades.
- Scene-cut detection. Identifies scene boundaries within clips.
- Content filtering. VLMs reject screencasts, game footage, and static photo content.
- Categorization. Clips are classified across 15 semantic dimensions for distribution balancing.
- Video embeddings. Extracted for downstream deduplication.
Pretraining Data
After segment-level curation, the pretraining branch applies GPU-accelerated deduplication: nearest-neighbor indexing over video embeddings groups near-duplicate clips into clusters, retaining only the highest-quality clip per cluster. Rule-based derived categories enable distribution rebalancing across content types.
Deduplicated clips then pass through 13 parallel extraction stages:
- Visual analysis: OCR text detection, lip-sync scoring, whole-body pose estimation with dense keypoints, anatomical quality scoring, and synthetic audio detection.
- Audio analysis: Language identification, speaker diarization, and ASR with word-level timestamps.
- Captioning and embeddings: An in-house audio-video understanding captioner producing rich descriptions, plus text embedding pre-encoding for diffusion conditioning.
- Latent pre-encoding: Multiple video VAE architectures producing latent representations for direct diffusion transformer training.
This produces the 100M+ clip pretraining corpus.
Audio-to-Video Fine-Tuning Data
The A2V branch applies additional avatar-specific curation to produce training data tailored for talking-head and portrait animation. Ten fine-grained quality signals are computed per clip:

Table 1: The 10 fine-grained quality signals computed per clip for A2V fine-tuning data.
These signals are composable—quality tiers can be constructed by combining different thresholds without re-running inference. This produces the 10M+ clip A2V fine-tuning corpus.
Human Data Annotation System
Achieving the highest quality thresholds, particularly for RLHF and quality model training, requires reliable human judgment at scale. A distributed annotation platform supports 100+ concurrent freelance annotators across multiple geographic regions.
Annotation tasks span five categories:
- Quality scoring. Curated clips rated across perceptual dimensions (visual quality, facial naturalness, lip-sync accuracy, motion smoothness) using calibrated Likert scales—serving as ground truth for automated quality models.
- Preference labeling. Pairwise comparisons of generated outputs along axes of identity preservation, expression naturalness, and audio-visual synchronization—producing training data for DPO and GRPO reward models.
- Bad case filtration. Identifying subtle artifacts that escape automated filters: temporal identity drift, teeth deformation, occlusion glitches, asymmetric blinking. Flagged samples feed back into corpus cleaning and new detector development.
- Competitive benchmarking. Blinded side-by-side comparisons of model iterations and competitor systems, producing win-rate matrices that guide improvement priorities.
- Attribute annotation. Gaze direction, emotion, gesture type, and speaking style for conditional generation supervision.
Organization
The workforce follows a three-tier hierarchy. Tier 1 annotators (100+ freelancers) complete qualification pipelines with calibration exercises before admission to production tasks. Tier 2 reviewers perform random audits and adjudicate disagreements to produce gold-standard labels. Tier 3 task designers define schemas, write guidelines, and manage the feedback loop between annotation results and model improvement.
Incentive System
Base compensation is supplemented by quality bonuses tied to agreement rates with reviewer audits. A leaderboard tracks accuracy and consistency, with top performers receiving priority access to higher-paying tasks. Annotators falling below thresholds are assigned re-calibration exercises. This closed-loop system maintains inter-annotator agreement above 85% across all task types.
Cross-Clip Identity Connectivity
Avatar video synthesis demands paired clips depicting the same individual across visually distinct contexts, enabling the model to disentangle identity from background, lighting, and pose. Two clips are linked if they depict the same individual (verified by high face similarity) in visually distinct scenes (verified by low background similarity), with sufficient duration for learning dynamic features.
The resulting connectivity graph enables efficient sampling of cross-scene reference pairs during training, organized into resolution-duration groups with balanced demographic representation. This curated data, together with the annotation signals described above, feeds directly into the progressive training pipeline.
Why This Matters
The data engine runs continuously, processing new sources as they become available and reprocessing existing data when models improve. It is not a one-time ETL job but a living system that evolves alongside the models it feeds. The pipeline’s scale—50M raw videos distilled into 100M+ pretraining clips and 10M+ fine-tuning clips, each annotated with dozens of quality signals and linked by identity—is what enables Avatar V to generalize across the enormous diversity of human appearance, motion, and expression.