HeyGen

Avatar video only feels believable if the person stays consistent over time. When the face drifts, the teeth change, the lip-sync slips, or the motion resets between clips, people notice straight away. This matters more for avatars than for many other AI video generation tasks because the viewer is watching a specific person speak, often at close range, for a long time.

In today’s AI video generation world, duration is still one of the most visible limitations. Many models and products expose generation as a fixed-length clip — a few seconds, with few systems able to generate more than a few minutes. For avatar products, that limit shows up directly in customer workflows. Customers want longer, consistent scenes/videos for training videos, sales demos, product walkthroughs, education, support, and agents that should keep talking until the task is done, and they also want fast preview to iterate on prompts, motion and script.

At HeyGen, that turned into three clear requirements:

Long-scene consistency. The avatar needs to preserve identity, lip-sync, expression and motion continuity not just for one short clip, but across many chunks of generated video.
No fixed duration limit. A generation might be ten seconds, ten minutes, or an open-ended real-time session.
Fast preview, realtime or faster-than-realtime generation. The system should start producing frames quickly and can even stream out the generated frames while inference is still running.

This post walks through the inference framework we built to meet those requirements.

The underlying model architecture

The framework is built around HeyGen's avatar video generation models — the Avatar IV and Avatar V families. At a high level, the model takes a reference image/video, driving audio, and optional text or scene conditioning, then generates a video of that avatar speaking with the right identity, expression, and motion.

The core generation model is a Diffusion Transformer, or DiT, trained with flow matching. Instead of compressing the person into a small identity embedding, the model conditions on rich reference tokens so it can preserve details that matter for avatars: face shape, teeth, skin texture, mouth movement, gesture style, and speaking rhythm.

The production inference path has three main stages:

Audio-to-video generation. A base DiT generates low-resolution video latents from the reference identity, audio features, and conditioning signals. This stage focuses on motion, lip-sync, and temporal coherence.
Identity-aware super-resolution. A second model refines those latents into high-resolution output, with extra attention on regions where people are most sensitive to artifacts, especially the face and mouth.
Streaming VAE decode. A VAE decoder converts high-resolution latents into RGB frames chunk by chunk, so frames can be emitted before the full video is complete.

To generate long videos, the system processes data in chunks. While the first chunk relies entirely on the static reference, subsequent chunks use boundary data from preceding segments. This allows the avatar to continue speaking naturally without resetting its posture or identity from scratch.

The streaming framework and pipeline loop

To accommodate chunk-based execution, the inference framework uses a modular, three-tier architecture that operates on localised windows of time, releasing resources immediately after a chunk is processed.

Module: A wrapper around a specific model and its checkpoint (e.g., A2V DiT, Super-Resolution DiT, VAE components, text/audio encoders).
Stage: A typed execution unit that coordinates one or more modules (e.g., context generation, super-resolution).
Pipeline: The execution graph that connects stages, manages shared state, and coordinates streaming or batch execution modes.

The initialisation phase encodes the reference identity into latents once per request. The pipeline then executes a continuous loop across the remaining stages until the input audio stream is exhausted:

Context generation: Converts incoming audio segments into features, combines them with text or scene conditioning, and prepares the target noise tensors.
Audio-to-Video: Runs a multi-step diffusion pass to generate low-resolution latents. This stage conditions the current chunk on the boundary frames of the previous chunk to maintain motion continuity.
Super-Resolution: Upscales the motion latents to full resolution in a single step, prioritising spatial detail on the face.
VAE Decode-and-Publish: Decodes the high-resolution latents into RGB frames and writes them directly to the output encoder (H.264 / AAC) for immediate storage or live playback.

Boundary continuity and chunk consistency

Generating video in distinct segments introduces potential boundary discontinuities. The framework mitigates this by utilising two distinct chunk classifications:

N Chunks: Segments that generate the main timeline of the avatar.
I Chunks (Interpolation): Segments designed to smooth transitions between sequential N chunks.

The execution sequence is structured as follows:

N0 -> N1 -> I0 -> N2 -> I1 -> N3 -> I2 -> ...

An I chunk is generated only after its preceding and succeeding N chunks are completed. It uses the final frame of the previous N chunk and an early frame of the current N chunk as anchor frames to compute the transitional motion. Following generation, the redundant anchor predictions are discarded, leaving only the smoothly interpolated transition. This mechanism bounds the required context window while preserving temporal consistency.

Constant memory over duration

A conventional video pipeline accumulates latents, decoded frames, and attention context during execution, causing GPU memory use to scale linearly with video duration.

To enable open-ended generation, this framework maintains a strict rolling state. The system retains only the static reference conditioning and a minimal set of anchor tensors required for chunk transitions. All intermediate assets—including audio features, noise tensors, internal activations, and raw RGB frames—are purged from memory immediately after a chunk is decoded and written.

As a result, the peak GPU memory profile remains constant whether generating a short clip or an extended sequence; resource utilisation scales with the defined chunk size rather than the total duration of the session.

Loading/offloading stages within the pipeline

Each request runs across an 8-GPU node. We use FSDP to shard large model parameters across GPUs. Each rank owns only a fraction of the weights, gathers the parameters it needs for a computation, and then frees them again. This is what lets multiple large models — the base DiT, the super-resolution DiT, the text encoder, the audio encoder, and the VAE — fit on one node.

There is a trade-off. FSDP introduces communication overhead during inference because parameters need to be gathered during forward passes. We use a combination of techniques to hide that overhead and to keep co-located models off the GPU when they are not in use:

Forward prefetching. The AllGather of the next block's parameters is issued ahead of time and overlapped with the current block's computation, hiding the gather latency on the critical path.
Lazy per-block unsharding from CPU. When a model is brought back from pinned CPU memory, we don’t stage the full set of weights up front. Each transformer block is unsharded (host-to-device copy + AllGather) just before its forward pass, so the H2D transfer of block n+1 overlaps with the compute of block n.
Pinned CPU offload between stages. The parameters of a model that isn’t currently running are kept in pinned CPU memory, so co-located models (base DiT, super-resolution DiT, text encoder, audio encoder, VAE) don’t all need to hold their weights on the GPU at the same time. Pinned memory is what makes the H2D copies fast enough to overlap with compute.
NUMA-aware process placement. Each process is pinned to the same NUMA node as its assigned GPU, so CPU↔GPU transfers run at full PCIe/NVLink bandwidth without crossing the inter-socket interconnect.

Sub-10ms model switching between stages

The practical payoff of the techniques above is that handing the GPU from one stage's model to the next — for example, A2V DiT → Super-Resolution DiT, or SR DiT → VAE decoder — is effectively free. Because the outgoing model is offloaded asynchronously and the incoming model's first block is unsharded just in time, the H2D copy and AllGather are both hidden behind compute that is already running. End to end, the observable per-switch overhead is under 10ms — well below a single-frame budget at our target frame rates. Concretely, this is what lets the streaming pipeline loop (Context Gen → A2V → SR → VAE Decode-and-Publish) cycle through several large models per chunk without the model swap itself ever becoming the bottleneck.

Real-time streaming publishing

To make the model fast enough to stream in realtime, we’ve done a lot of inference optimisations. Please refer to https://www.heygen.com/research/avatar-v-inference for more detail on this part.

Once the pipeline emits video chunk by chunk in real time, streaming delivery becomes a natural extension of inference instead of a separate post-processing step.

For the broadcast-style realtime path, we publish generated frames to Amazon Kinesis Video Streams (KVS). KVS is usually discussed in the context of cameras, IoT devices, and uploaded media. In our case, the "camera" is the inference pipeline itself: frames are created by the model, encoded immediately, and pushed into KVS as a live stream.

The output writer receives decoded RGB frames from the streaming VAE and sends them into a GStreamer pipeline. Video is encoded as H.264 and audio as AAC, then both tracks are pushed into kvssink, the KVS producer sink. From there, viewers can play the session back as a live stream while it is still being generated.

Results and lessons learned

The framework shifted Avatar IV and Avatar V generation from fixed-scene rendering to open-ended streaming generation. The key outcome is straightforward: we removed scene-duration limits for Avatar IV and Avatar V. For realtime Avatar IV generation, we’ve achieved a time to first frame of under 5 seconds and generation at more than 27 frames per second for 720p Avatar IV videos — faster than realtime playback.