Avatar V Inference: Optimizing Video Generation at Scale

Avatar V generates videos of arbitrary length at 1080p, 30 fps. Making that work in production—serving thousands of concurrent requests across thousands of GPUs—required rethinking the inference stack from the ground up. All inference uses the distilled model, which internalizes classifier-free guidance and operates in 24 denoising steps after two-phase distillation, eliminating the need for separate conditional and unconditional forward passes.

End-to-End Pipeline

Inference proceeds in four stages:

Preprocessing. The user’s reference video is encoded once into video reference tokens, identity embeddings, and expression embeddings. Audio features are extracted from the target audio track. Scene prompts are encoded into text embeddings. The Identity-Preserving Image Engine generates a scene image conditioned on the reference identity. These steps run in parallel.
DiT generation. The base-resolution DiT performs chunk-based autoregressive generation, conditioning on all preprocessed signals through Sparse Reference Attention.
Super-resolution. An identity-aware SR refiner upscales the output to high resolution in a single denoising step.
Streaming decode. A streaming VAE decoder converts latents to pixels incrementally, producing output frames before the full video is complete.

Flowchart showing a video generation system. "Your Idea" and an "Uploaded ID Video" (for personality) go through a "Video Agent", "Audio/Image Engine", "Prompt Enhancer", and finally "AvatarV" (with "VideoRef DIT" and "SR Refiner") to produce "Your Video".

Figure 1: Avatar V inference pipeline. The user’s identity video is processed once into a reusable personality embedding. Scene image generation and prompt engineering proceed in parallel, then all signals are combined for low-resolution DiT generation followed by identity-aware super-resolution.

Chunk-Based Long-Form Generation

Generating a 10-minute video in a single forward pass is impractical. Memory and compute costs grow with sequence length, and a single failure would require regenerating everything. Avatar V splits long-form generation into chunks, each producing 41 latent frames—approximately 6.4 seconds of video at 25 fps.

The first chunk operates in ref2v mode, where the reference video frame is encoded as the identity conditioning signal. Subsequent chunks operate in prefix2v mode: the last frames of the previous chunk serve as the prefix condition for the next, providing a smooth temporal bridge. Adjacent chunks share a 2-frame overlap to ensure seamless transitions. A global appearance anchor extracted from the first chunk, combined with motion frame propagation, ensures identity consistency across arbitrarily long videos.

Diffusion Sampling

Deterministic ODE-based samplers for flow matching models struggle with high-frequency details at reduced step counts—unstable hand clarity, blurred teeth, and temporally inconsistent fine textures. Avatar V adopts an improved stochastic Euler sampler: at each step, the sample is advanced beyond the target noise level by a controlled overshoot factor, then stochastically renoised back to the correct level with fresh Gaussian noise. This controlled stochasticity improves detail recovery in high-frequency regions and prevents the accumulation of discretization errors, enabling high-quality generation in 24 steps with stable hand, teeth, and facial detail quality.

VideoRef Context Caching and Sparse Attention

Video reference tokens, encoded from clean reference frames, remain invariant across denoising steps. Avatar V exploits this through a two-level caching strategy:

Context-level caching. At the first denoising step, the full video reference context—latents, audio features, validity masks, expression and identity embeddings—is computed and cached. For all subsequent steps, this cached context is reused without recomputation, avoiding the cost of re-encoding the reference video at each of the 24 denoising steps.
Attention-level KV caching. Within each transformer block’s reference self-attention layer, the key and value projections of video reference tokens are computed once at the first step and cached in GPU memory. Subsequent steps directly concatenate the cached reference KV tensors with freshly computed generation token KV tensors, eliminating redundant linear projections and RoPE computations.
Sparse validity masking. Video reference sequences can contain invalid tokens—frames where the face is not visible. Sparse attention masks skip computation for these tokens entirely. The validity masks are precomputed per-rank for the sequence-parallel layout, avoiding wasted computation on non-informative reference tokens.

Distributed Inference with Sequence Parallelism

Avatar V distributes inference across 8 GPUs within a single node using Ulysses Sequence Parallelism (USP). The input sequence—video latents, reference tokens, high-resolution face tokens, and conditioning tokens—is partitioned along the sequence dimension across GPU ranks, with all-to-all communication for attention operations that require cross-rank token interaction.

FSDP2 with CPU Offloading

Model parameters are sharded across GPUs using FSDP2 with CPU offloading for inactive parameter shards. This frees GPU memory for the large intermediate activations required by the DiT and enables multi-model co-location: multiple model variants can reside on a single machine with rapid switching via CPU-to-GPU loading rather than full reloading from disk. Forward prefetching overlaps the AllGather of the next block’s parameters with the current block’s computation, hiding communication latency. Processes are pinned to the NUMA node of their assigned GPU for optimal memory bandwidth.

Inference Acceleration

Deploying Avatar V at production scale surfaces three latency bottlenecks: kernel launch overhead and memory bandwidth waste from thousands of small operators per transformer block, coarse-grained inter-GPU synchronization in sequence-parallel attention, and hardware-level frequency variance across GPU ranks causing straggler effects at collective boundaries.

Custom Compiler with Agentic Kernel Synthesis

Standard torch.compile with the Inductor backend is insufficient for production diffusion inference at this scale. Inductor’s pattern matcher misses many cross-operator fusion opportunities, its generated Triton kernels are suboptimal for the specific tensor shapes in the model, and it handles dynamic shapes poorly—relying on guard-and-recompile strategies that trigger excessive recompilation.

Avatar V introduces a compiler workflow that combines human expertise with LLM-based kernel generation. Engineers profile the forward pass and define fusion scopes—which operator subgraphs should be merged into single kernel launches. An LLM-based agent then takes each fusion specification and generates CUDA/Triton kernel candidates through an iterative evolution process.

A key challenge is that kernel-level profiling is inherently noisy due to GPU thermal state, memory allocator behavior, and scheduling variance. To mitigate this, the system uses an evolution island strategy: 2–3 islands run in parallel, each exploring different tiling strategies and memory access patterns across 4 candidates per generation. Fitness is evaluated on both kernel latency and numerical accuracy, and the best candidate is selected across islands after a fixed iteration budget.

The agentic workflow produces mega kernels that fuse entire non-attention portions of each transformer block into single kernel launches, eliminating intermediate tensor materializations. The compiled forward pass reduces from thousands of small kernels to only Flash Attention calls, cuBLAS GEMMs, and a handful of fused mega kernels—achieving 3× latency reduction over the unoptimized baseline and 33% improvement over torch.compile Inductor.

NVSHMEM-Based Sequence Parallelism

Standard NCCL-based all-to-all communication operates at kernel-level granularity: each all-to-all must fully complete before downstream compute begins. Avatar V replaces this with NVSHMEM-based communication that exploits NVLink for direct GPU-to-GPU data movement with tile-level dataflow control.

With NVSHMEM, individual data tiles can be sent, received, and synchronized within a single fused kernel, enabling sub-tensor pipelining: the all-to-all scatter, cuBLAS GEMM computation, and all-to-all gather overlap at fine granularity rather than executing sequentially. As each tile arrives from a remote rank, it is immediately available for computation without waiting for the full transfer to complete.

System-Level Optimization

NUMA-aware process placement. Each inference rank is pinned to the CPU cores and memory controllers on the same NUMA node as its assigned GPU, ensuring optimal bandwidth for CPU-GPU parameter transfers required by FSDP2 offloading.

GPU clock locking. In distributed inference where all ranks synchronize at collective operations, the slowest rank determines overall latency. Default GPU boost clocking allows frequency to vary across GPUs depending on thermal and power state, creating straggler ranks that gate every synchronization point. Locking all GPUs to a stable frequency below the boost ceiling eliminates this variance and reduces overall latency by approximately 3%.

Super-Resolution

The base DiT generates video at low resolution in latent space. A single-step adversarial SR model upscales to 1080p, focusing computational resources on high-detail regions—particularly the mouth area for lip-sync fidelity. The refiner inherits the same identity conditioning as the base model through the context cache. Low-resolution latents are noised at σ = 0.6 before the SR step, providing enough room for detail enhancement while preserving structural content from the base generation.

Streaming VAE Decode

The standard approach decodes the full latent video to pixels in one pass, requiring enough memory to hold every frame simultaneously. Avatar V replaces this with a streaming VAE decoder that uses causal 3D convolutions with temporal feature caching, enabling chunk-by-chunk decoding. Decoded frames are piped directly into an asynchronous streaming video encoder that writes the output file incrementally. Peak memory stays bounded regardless of video length, and the first frames are available before the full video has been decoded.

Results

Table detailing seven technical optimizations and their impacts on performance and memory management.

Table 1: Key inference optimizations and their production impact.

From Model to Production: Optimizing Avatar V Inference at Scale