Research

Technical reportsApr 8, 2026

Avatar V: Scaling Video-Reference Avatar Generation

Avatar V is built on a Diffusion Transformer with flow matching that conditions directly on the full token sequence of a user’s reference video—no bottleneck embeddings. Sparse Reference Attention keeps cost almost linear with reference length. A five-stage training curriculum progresses from general video pre-training through identity-preserving fine-tuning, distillation, and RLHF alignment.

Technical reportsApr 3, 2026

Curating Millions of Videos: The Data Engine Behind Avatar V

A distributed data engine orchestrating 25+ processing stages and 20+ specialized AI models transforms 50M raw videos into 100M+ pretraining clips and 10M+ avatar fine-tuning clips. A 10-stage segment-level curation cascade, 13 parallel feature extraction stages, 10 fine-grained avatar quality signals, and a cross-clip identity connectivity graph produce the training data that makes Avatar V possible.

Technical reportsApr 2, 2026

From Model to Production: Optimizing Avatar V Inference at Scale

Avatar V generates 1080p video at 30 fps across 8 GPUs per request. A custom compiler with LLM-based agentic kernel synthesis achieves 3× latency reduction over the unoptimized baseline and 33% improvement over torch.compile. Chunk-based autoregressive generation enables arbitrary-length output, while NVSHMEM-based sequence parallelism, two-level context caching, and streaming VAE decode keep memory bounded and throughput high.

Technical reportsApr 1, 2026

HELIOS: Unified GPU Infrastructure for Training, Inference, and Data at Scale

HELIOS is a unified GPU infrastructure platform managing 5,000+ GPUs across 5+ cloud providers and 15+ standardized cells. A two-stage QoS-aware scheduler improved GPU utilization by 15% and reduced non-productive GPU time by 20%. A custom declarative data processing engine replaced Ray, scaling to 200K+ concurrent tasks with 95%+ GPU utilization and node failure detection under 30 seconds.

Technical reportsMar 3, 2026

TAVR: Generate Your Talking Avatar from Video Reference

TAVR replaces single-image avatar references with short video clips, enabling cross-scene generation with significantly better identity preservation. A three-stage training strategy bridges the domain gap between reference and target scenes. On a new cross-scene benchmark, TAVR yields the best identity similarity and achieves an overall quality score of 16.42 vs 14.13 for the next best method.