HeyGen

Avatar video only feels convincing if the person remains consistent over time. When the face drifts, the teeth change, the lip-sync slips, or the motion resets between clips, people notice straightaway. This is even more important for avatars than for many other video generation tasks because the viewer is watching a specific person speak, often from close range, for an extended period.

In today’s video generation landscape, duration is still one of the most visible limitations. Many models and products offer generation as a fixed-length clip — just a few seconds, with very few systems able to generate more than a few minutes. For avatar products, that limitation shows up directly in customer workflows. Customers want longer, consistent scenes/videos for training programmes, sales demos, product walkthroughs, education, support, and virtual agents that should keep talking until the task is completed. They also want fast previews to quickly iterate on prompts, motion, and script.

At HeyGen, this led to three clear requirements:

Long-scene consistency The avatar needs to preserve identity, lip sync, expression, and motion continuity not just for one short clip, but across many segments of generated video.
No fixed duration limit. A generation might be ten seconds, ten minutes, or an open-ended real-time session.
Fast preview, real-time or faster-than-real-time generation The system should start producing frames quickly and even allow streaming out the generated frames while inference is still in progress.

This post explains the inference framework we have built to fulfil those requirements.

The Underlying Model Architecture

The framework is built around HeyGen's avatar video generation models — the Avatar IV and Avatar V families. At a high level, the model takes a reference image or video, driving audio, and optional text or scene conditioning, then generates a video of that avatar speaking with the correct identity, expression, and motion.

The core generation model is a Diffusion Transformer, or DiT, trained with flow matching. Instead of compressing the person into a small identity embedding, the model conditions on rich reference tokens so it can preserve details that matter for avatars: face shape, teeth, skin texture, mouth movement, gesture style, and speaking rhythm.

The production inference path has three main stages:

Audio-to-video generation. A base DiT generates low-resolution video latents from the reference identity, audio features, and conditioning signals. This stage focuses on motion, lip sync, and temporal coherence.
Identity-aware super-resolution. A second model refines those latents into high-resolution output, with extra attention on regions where people are most sensitive to artifacts, especially the face and mouth.
Streaming VAE decode. A VAE decoder converts high-resolution latents into RGB frames, chunk by chunk, so frames can be generated before the full video is complete.

To generate long videos, the system processes data in chunks. While the first chunk relies entirely on the static reference, subsequent chunks use boundary data from preceding segments. This allows the avatar to continue speaking naturally without resetting its posture or identity from scratch.

The Streaming Framework and Pipeline Loop

To support chunk-based execution, the inference framework uses a modular, three-tier architecture that operates on localised windows of time, releasing resources immediately after each chunk is processed.

Module: A wrapper around a specific model and its checkpoint (for example, A2V DiT, Super-Resolution DiT, VAE components, text/audio encoders).
Stage: A typed execution unit that coordinates one or more modules (for example, context generation, super-resolution).
Pipeline: The execution graph that connects stages, manages shared state, and coordinates streaming or batch execution modes.

The initialisation phase encodes the reference identity into latents once per request. The pipeline then runs a continuous loop across the remaining stages until the input audio stream is fully consumed:

Context Generation: Converts incoming audio segments into features, combines them with text or scene conditioning, and prepares the target noise tensors.
Audio-to-Video: Carries out a multi-step diffusion pass to generate low-resolution latents. At this stage, the current chunk is conditioned on the boundary frames of the previous chunk to preserve smooth motion.
Super-Resolution: Upscales the motion latents to full resolution in a single step, prioritising spatial detail on the face.
VAE Decode-and-Publish: Decodes the high-resolution latents into RGB frames and writes them directly to the output encoder (H.264 / AAC) for instant storage or live playback.

Boundary Continuity and Chunk Consistency

Generating video in separate segments can lead to possible boundary discontinuities. The framework addresses this by using two distinct chunk classifications:

N Chunks: Segments that generate the main timeline of the avatar.
I Chunks (Interpolation): Segments intended to smooth transitions between sequential N chunks.

The order of execution is organised in the following way:

N0 -> N1 -> I0 -> N2 -> I1 -> N3 -> I2 -> ...

An I chunk is generated only after its preceding and succeeding N chunks are completed. It uses the final frame of the previous N chunk and an early frame of the current N chunk as anchor frames to compute the transitional motion. After generation, the redundant anchor predictions are discarded, leaving only the smoothly interpolated transition. This mechanism limits the required context window while preserving temporal consistency.

Constant memory throughout the duration

A conventional video pipeline accumulates latents, decoded frames, and attention context during execution, causing GPU memory usage to increase in direct proportion to the video duration.

To support open-ended generation, this framework maintains a strict rolling state. The system keeps only the static reference conditioning and a minimal set of anchor tensors needed for chunk transitions. All intermediate assets—including audio features, noise tensors, internal activations, and raw RGB frames—are cleared from memory immediately after a chunk is decoded and written.

As a result, the peak GPU memory profile remains constant whether you are generating a short clip or an extended sequence; resource utilisation scales with the defined chunk size rather than the total duration of the session.

Loading/Offloading stages within the pipeline

Each request runs on an 8-GPU node. We use FSDP to shard large model parameters across GPUs. Each rank owns only a fraction of the weights, gathers the parameters it needs for a computation, and then releases them again. This is what allows multiple large models — the base DiT, the super-resolution DiT, the text encoder, the audio encoder, and the VAE — to fit on a single node.

There is a trade-off. FSDP introduces communication overhead during inference because parameters need to be gathered during forward passes. We use a combination of techniques to mask that overhead and to keep co-located models off the GPU when they are not in use:

Forward prefetching. The AllGather for the next block’s parameters is triggered in advance and overlapped with the current block’s computation, which helps to hide the gather latency on the critical path.
Lazy per-block unsharding from CPU. When a model is brought back from pinned CPU memory, we do not load the full set of weights in advance. Each transformer block is unsharded (host-to-device copy + AllGather) just before its forward pass, so the H2D transfer of block n+1 overlaps with the computation of block n.
Pinned CPU offload between stages. The parameters of a model that is not currently running are kept in pinned CPU memory, so co-located models (base DiT, super-resolution DiT, text encoder, audio encoder, VAE) do not all need to hold their weights on the GPU at the same time. Pinned memory is what makes the H2D copies fast enough to overlap with computation.
NUMA-aware process placement. Each process is pinned to the same NUMA node as its assigned GPU, so CPU↔GPU transfers run at full PCIe/NVLink bandwidth without crossing the inter-socket interconnect.

Model switching between stages in under 10 ms

The practical benefit of the techniques above is that handing the GPU from one stage’s model to the next — for example, A2V DiT → Super-Resolution DiT, or SR DiT → VAE decoder — is effectively free. Because the outgoing model is offloaded asynchronously and the incoming model’s first block is unsharded just in time, the H2D copy and AllGather are both hidden behind compute that is already running. End to end, the observable per-switch overhead is under 10ms — comfortably below a single-frame budget at our target frame rates. In practical terms, this is what allows the streaming pipeline loop (Context Gen → A2V → SR → VAE Decode-and-Publish) to cycle through several large models per chunk without the model swap itself ever becoming the bottleneck.

Real-time streaming publishing

To make the model fast enough for realtime streaming, we have implemented extensive inference optimisations; please refer to https://www.heygen.com/research/avatar-v-inference for more detailed information on this aspect.

Once the pipeline emits the video chunk by chunk in real time, streaming delivery becomes a natural extension of inference rather than a separate post-processing step.

For the broadcast-style real-time path, we publish generated frames to Amazon Kinesis Video Streams (KVS). KVS is usually discussed in the context of cameras, IoT devices, and uploaded media. In our case, the “camera” is the inference pipeline itself: frames are created by the model, encoded immediately, and pushed into KVS as a live stream.

The output writer receives decoded RGB frames from the streaming VAE and sends them into a GStreamer pipeline. The video is encoded as H.264 and the audio as AAC, and then both tracks are pushed into kvssink, the KVS producer sink. From there, viewers can play back the session as a live stream while it is still being generated.

Results and key learnings

The framework transformed Avatar IV and Avatar V generation from fixed-scene rendering to open-ended streaming generation. The key outcome is straightforward: we removed scene-duration limits for Avatar IV and Avatar V. For realtime Avatar IV generation, we now achieve a time to first frame of under 5 seconds and a generation speed of over 27 frames per second for 720p Avatar IV videos — faster than realtime playback.