Engineering at HeyGen: Inside the team building AI video at scale

Featured

HeyGen × Stripe Projects: The Missing Piece for Autonomous Product Launches

Jun 15, 2026

Smiling faces of diverse individuals in glowing geometric frames, representing digital collaboration.

Featured

Avatar Real-time: The Technical Report Behind Low-Latency, Unlimited-Duration Generation

Jun 3, 2026

Auto-Optim: A Blind Agent Matched Our Hard-Won 3x, With No human in the Loop

Jun 29, 2026

We handed Claude Code a benchmark, a tamper-proof quality gate, and one rule file, then walked away. On a single H100 it made the Wan 2.2 video decoder 3× faster on its own. The same speedup cost us 105 hand-steered experiments the first time; this run took 18, blind.

Smiling robot coding on a laptop, encircled by a cycle arrow, with a checklist and a "3x" growth indicator.

HTML to Video Was Not Easy: Here’s How We Solved It

Jun 22, 2026

HyperFrames makes it possible for AI agents to create videos with HTML. Learn how HeyGen solved deterministic rendering, frame capture, video playback, and preview parity.

A video player showing a man overlaid on a code editor, with a play button.

HeyGen × Stripe Projects: The Missing Piece for Autonomous Product Launches

Jun 15, 2026

Agents can build a product in a day but not launch it. With Stripe Projects and the HeyGen API, an agent can now provision, pay, and produce its own launch video.

Avatar Real-time: The Technical Report Behind Low-Latency, Unlimited-Duration Generation

Jun 3, 2026

The inference framework transforms avatar generation from fixed-length rendering into open-ended streaming video synthesis. A chunk-based pipeline maintains identity, motion, and lip-sync consistency across arbitrarily long videos while operating with constant memory usage. Combined with model-sharding, asynchronous offloading, and streaming decode, the system achieves sub-5-second time-to-first-frame and faster-than-realtime generation speeds.

Avatar V: Scaling Video-Reference Avatar Generation

Apr 8, 2026

Avatar V is built on a Diffusion Transformer with flow matching that conditions directly on the full token sequence of a user’s reference video—no bottleneck embeddings. Sparse Reference Attention keeps cost almost linear with reference length. A five-stage training curriculum progresses from general video pre-training through identity-preserving fine-tuning, distillation, and RLHF alignment.

A large smiling portrait of a woman in a teal top, with three smaller triangular portraits of her in different poses.

Curating Millions of Videos: The Data Engine Behind Avatar V

Apr 3, 2026

A distributed data engine orchestrating 25+ processing stages and 20+ specialized AI models transforms 50M raw videos into 100M+ pretraining clips and 10M+ avatar fine-tuning clips. A 10-stage segment-level curation cascade, 13 parallel feature extraction stages, 10 fine-grained avatar quality signals, and a cross-clip identity connectivity graph produce the training data that makes Avatar V possible.

A grid of video thumbnails featuring diverse people in various settings and an animated deer.

From Model to Production: Optimizing Avatar V Inference at Scale

Apr 2, 2026

Avatar V generates 1080p video at 25 fps across 8 GPUs per request. A custom compiler with LLM-based agentic kernel synthesis achieves 3× latency reduction over the unoptimized baseline and 33% improvement over torch.compile. Chunk-based autoregressive generation enables arbitrary-length output, while NVSHMEM-based sequence parallelism, two-level context caching, and streaming VAE decode keep memory bounded and throughput high.

HELIOS: Unified GPU Infrastructure for Training, Inference, and Data at Scale

Apr 1, 2026

HELIOS is a unified GPU infrastructure platform managing 5,000+ GPUs across 5+ cloud providers and 15+ standardized cells. A two-stage QoS-aware scheduler improved GPU utilization by 15% and reduced non-productive GPU time by 20%. A custom declarative data processing engine replaced Ray, scaling to 200K+ concurrent tasks with 95%+ GPU utilization and node failure detection under 30 seconds.

HELIOS: Unified GPU Infrastructure for Training, Inference, and Data at Scale

TransVLM: Detecting Any Shot Transition with Vision-Language Models

Mar 1, 2026

We reformulate shot boundary detection as Shot Transition Detection (STD)—finding complete transition segments, not just cut points. TransVLM fuses optical flow with color frames in a vision-language model to detect all transition types: cuts, dissolves, and special effects. It achieves 78.3% segment F1 on public data and 89.5% on synthetic data, outperforming all existing methods.

Split image of a man in purple glasses: on the left, he wears a pink shirt in a retro room; on the right, he wears a green sweater in an office, reaching out.