HeyGen

Avatar video only feels believable if the person stays consistent over time. When the face drifts, the teeth change, the lip sync slips, or the motion resets between clips, people notice immediately. This matters more for avatars than for many other video generation tasks because the viewer is watching a specific person speak, often at close range, for a long time.

In today's video generation world, duration is still one of the most visible limitations. Many models and products expose generation as a fixed-length clip — a few seconds, with few systems able to generate more than a few minutes. For avatar products, that limit shows up directly in customer workflows. Customers want longer consistent scenes/videos for training videos, sales demos, product walkthroughs, education, support, and agents that should keep talking until the task is done, and also they want fast preview to iterate on prompts, motion and script.

At Heygen, that translated into three concrete requirements:

Long-scene consistency. The avatar needs to preserve identity, lip sync, expression, and motion continuity not just for one short clip, but across many chunks of generated video.
No fixed duration cap. A generation might be ten seconds, ten minutes, or an open-ended realtime session.
Fast preview, realtime or faster-than-realtime generation. The system should start producing frames quickly and even allow streaming out the generated frames while inference is still ongoing.

This post walks through the inference framework we built to meet those requirements.

The Underlying Model Architecture

The framework is built around HeyGen's avatar video generation models — the Avatar IV and Avatar V families. At a high level, the model takes a reference image/video, driving audio, and optional text or scene conditioning, then generates a video of that avatar speaking with the right identity, expression, and motion.

The core generation model is a Diffusion Transformer, or DiT, trained with flow matching. Instead of compressing the person into a small identity embedding, the model conditions on rich reference tokens so it can preserve details that matter for avatars: face shape, teeth, skin texture, mouth movement, gesture style, and speaking rhythm.

The production inference path has three main stages:

Audio-to-video generation. A base DiT generates low-resolution video latents from the reference identity, audio features, and conditioning signals. This stage focuses on motion, lip sync, and temporal coherence.
Identity-aware super-resolution. A second model refines those latents into high-resolution output, with extra attention on regions where people are most sensitive to artifacts, especially the face and mouth.
Streaming VAE decode. A VAE decoder converts high-resolution latents into RGB frames chunk by chunk, so frames can be emitted before the full video is complete.

To generate long videos, the system processes data in chunks. While the first chunk relies entirely on the static reference, subsequent chunks use boundary data from preceding segments. This allows the avatar to continue speaking naturally without resetting its posture or identity from scratch.

The Streaming Framework and Pipeline Loop

To accommodate chunk-based execution, the inference framework uses a modular, three-tier architecture that operates on localized windows of time, releasing resources immediately after a chunk is processed.

Module: A wrapper around a specific model and its checkpoint (e.g., A2V DiT, Super-Resolution DiT, VAE components, text/audio encoders).
Stage: A typed execution unit that coordinates one or more modules (e.g., context generation, super-resolution).
Pipeline: The execution graph that wires stages together, manages shared state, and coordinates streaming or batch execution modes.

The initialization phase encodes the reference identity into latents once per request. The pipeline then executes a continuous loop across the remaining stages until the input audio stream is exhausted:

Context Generation: Converts incoming audio segments into features, combines them with text or scene conditioning, and prepares the target noise tensors.
Audio-to-Video: Executes a multi-step diffusion pass to produce low-resolution latents. This stage conditions the current chunk on the boundary frames of the previous chunk to maintain motion continuity.
Super-Resolution: Upscales the motion latents to full resolution in a single step, prioritizing spatial detail on the face.
VAE Decode-and-Publish: Decodes the high-resolution latents into RGB frames and writes them directly to the output encoder (H.264 / AAC) for immediate storage or live playback.

Boundary Continuity and Chunk Consistency

Generating video in distinct segments introduces potential boundary discontinuities. The framework mitigates this by utilizing two distinct chunk classifications:

N Chunks: Segments that generate the primary timeline of the avatar.
I Chunks (Interpolation): Segments designed to smooth transitions between sequential N chunks.

The execution sequence is structured as follows:

N0 -> N1 -> I0 -> N2 -> I1 -> N3 -> I2 -> ...

An I chunk is generated only after its preceding and succeeding N chunks are completed. It uses the final frame of the previous N chunk and an early frame of the current N chunk as anchor frames to compute the transitional motion. Following generation, the redundant anchor predictions are discarded, leaving only the smoothly interpolated transition. This mechanism bounds the required context window while preserving temporal consistency.

Constant memory over duration

A conventional video pipeline accumulates latents, decoded frames, and attention context during execution, causing GPU memory consumption to scale linearly with video duration.

To enable open-ended generation, this framework maintains a strict rolling state. The system retains only the static reference conditioning and a minimal set of anchor tensors required for chunk transitions. All intermediate assets—including audio features, noise tensors, internal activations, and raw RGB frames—are purged from memory immediately after a chunk is decoded and written.

כתוצאה מכך, פרופיל הזיכרון המקסימלי של ה‑GPU נשאר קבוע בין אם יוצרים סרטון קצר ובין אם רצף ארוך; ניצול המשאבים גדל בהתאם לגודל הצ׳אנק שהוגדר ולא לאורך הכולל של הסשן.

Loading/Offloading stages within the pipeline

Each request runs across an 8-GPU node. We use FSDP to shard large model parameters across GPUs. Each rank owns only a fraction of the weights, gathers the parameters it needs for a computation, and then frees them again. This is what lets multiple large models — the base DiT, the super-resolution DiT, the text encoder, the audio encoder, and the VAE — fit on one node.

יש כאן פשרה. FSDP מוסיף תקורת תקשורת בזמן אינפרנס, כי צריך לאסוף פרמטרים במהלך מעבר קדימה. אנחנו משתמשים בשילוב של טכניקות כדי להסתיר את התקורה הזו ולשמור מודלים שמרוכזים יחד מחוץ ל-GPU כשהם לא בשימוש:

Forward prefetching. The AllGather of the next block's parameters is issued ahead of time and overlapped with the current block's computation, hiding the gather latency on the critical path.
Lazy per-block unsharding from CPU. When a model is brought back from pinned CPU memory, we do not stage the full set of weights up front. Each transformer block is unsharded (host-to-device copy + AllGather) just before its forward pass, so the H2D transfer of block n+1 overlaps with the compute of block n.
העברת עומס ל-CPU נעול בין שלבים. הפרמטרים של מודל שלא רץ כרגע נשמרים בזיכרון CPU נעול, כך שמודלים שנמצאים באותו מיקום (base DiT, super-resolution DiT, text encoder, audio encoder, VAE) לא צריכים כולם להחזיק את המשקולות שלהם על ה-GPU בו־זמנית. הזיכרון הנעול הוא מה שהופך את העתקות ה-H2D למהירות מספיק כדי לחפוף לחישוב.
מיקום תהליכים מודע‑NUMAכל תהליך מוצמד לאותו צומת NUMA כמו ה‑GPU שהוקצה לו, כך שהעברות CPU↔GPU פועלות ברוחב הפס המלא של PCIe/NVLink בלי לחצות את קישוריות הבין‑שקעית.

החלפת מודלים בפחות מ־10ms בין שלבים

התועלת המעשית של הטכניקות למעלה היא שהעברת ה‑GPU ממודל של שלב אחד למודל של השלב הבא — למשל, A2V DiT → ‏Super-Resolution DiT, או SR DiT → ‏VAE decoder — כמעט ולא עולה לנו כלום. בגלל שהמודל היוצא מופרד מה‑GPU בצורה אסינכרונית, ובלוק הראשון של המודל הנכנס מפוצל מחדש בדיוק בזמן, גם העתקת ה‑H2D וגם פעולת ה‑AllGather מוסתרות מאחורי חישוב שכבר רץ. מקצה לקצה, התקורה הנצפית לכל מעבר בין מודלים היא מתחת ל‑10ms — הרבה מתחת לתקציב של פריים בודד בקצבי הפריימים שאליהם אנחנו מכוונים. בפועל, זה מה שמאפשר ללופ של ה‑streaming pipeline (Context Gen → ‏A2V → ‏SR → ‏VAE Decode-and-Publish) לעבור דרך כמה מודלים גדולים בכל chunk בלי שההחלפה בין המודלים תהפוך בעצמה לצוואר בקבוק.

פרסום סטרימינג בזמן אמת

כדי שהמודל יהיה מספיק מהיר לסטרימינג בזמן אמת, ביצענו הרבה אופטימיזציות לאינפרנס, אפשר לעיין ב־https://www.heygen.com/research/avatar-v-inference לקבלת פרטים נוספים על החלק הזה.

ברגע שהפייפליין מפיק וידאו מקטע אחר מקטע בזמן אמת, סטרימינג הופך להמשך טבעי של תהליך האינפרנס במקום שלב נפרד של פוסט־פרוססינג.

במסלול הרילטיים בסגנון שידור, אנחנו מפרסמים את הפריימים שנוצרים ל‑Amazon Kinesis Video Streams ‏(KVS). בדרך כלל מדברים על KVS בהקשר של מצלמות, מכשירי IoT ומדיה שמועלת לשרת. במקרה שלנו, ה״מצלמה״ היא צינור האינפרנס עצמו: הפריימים נוצרים על ידי המודל, מקודדים מיד, ונשלחים ל‑KVS כזרם חי.

כותב הפלט מקבל פריימים מפוענחים ב‑RGB מה‑VAE הסטרימי ושולח אותם לפייפליין של GStreamer. הווידאו מקודד כ‑H.264 והאודיו כ‑AAC, ואז שני הערוצים נשלחים אל kvssink, ה‑sink של מפיק ה‑KVS. משם, הצופים יכולים לנגן את הסשן כלייב סטרים בזמן שהוא עדיין נוצר.

תוצאות ותובנות

המסגרת שינתה את יצירת Avatar IV ו‑Avatar V מרינדור סצנה קבועה ליצירה סטרימינג פתוחה וללא הגבלה. התוצאה הכי חשובה פשוטה: הסרנו את מגבלות משך הסצנה עבור Avatar IV ו‑Avatar V. ביצירת Avatar IV בזמן אמת, הצלחנו להגיע לזמן עד פריים ראשון של פחות מ‑5 שניות ולקצב יצירה של יותר מ‑27 פריימים לשנייה עבור סרטוני Avatar IV ברזולוציית 720p — מהר יותר מניגון בזמן אמת.