HTML to Video: How HyperFrames Solved AI Video Rendering

The Problem

Creating video is hard, but LLMs are great at writing code specifically HTML. What if we can have LLMs write code to create a video?

The pitch was clean (and exciting!).

The reality was not…

About a year ago we started our journey. Huge prompts, lots of back-and-forth, and plenty of hand-written glue to cover what the model missed. We added an agent to hand hold the model. It helped, but it still wasn’t production ready.

Next up we tried Remotion. It’s the right shape for deterministic video (React components, great tooling, production ready rendering), but the React framework kept boxing the agent in.

Outputs got safer and more repetitive the more guardrails we added. When we dropped back to plain HTML/CSS/JS, the creativity came back.

That raised the real question:

“Can we keep the freedom of HTML and still render a deterministic MP4?”

HyperFrames is our answer:

a minimal authoring model (HTML + data-* clip attributes)
a pluggable animation runtime
a render pipeline that forces headless Chrome to produce the same pixels every run

Browsers don’t want to do this. Rendering is threaded and asynchronous: images decode in the background, videos drop frames under load, animations follow the display clock. All of that is performance—and nondeterminism.

Long story short, we set out to solve it and we did!

However, we also wanted to ensure it worked well and grew with the models. One tell that a framework is fighting an agent is if you need the biggest model just to get working output.

We started by shaping the simplest version of HyperFrames around what Gemini Flash could reliably author.

From there, we ran evals across different models, tightened the skills and the runtime wherever they failed, and repeated.

The goal was not to optimize for one model.

It was to make the authoring model simple enough, and the agent scaffolding strong enough, that a wide range of models could produce usable compositions.

We build AI models and agents for a living. The feedback loop that tightens an agent's output is the same loop that tightens HyperFrames.

How We Got Here...

The One Trick: Seek, Don't Play

Every composition in HyperFrames exposes exactly one thing to the runtime:

A JavaScript code snippet defining an object with duration and seek properties.

The renderer never calls play(). It calls seek(0), screenshots, seek(1/30), screenshots, seek(2/30), screenshots, until it has 300 frames for a 10-second 30fps video.

Time doesn't advance on its own. Nothing is driven by requestAnimationFrame. The browser's job is to hold a fixed frame until the next one is requested.

This one abstraction is what collapses two very different systems into one codebase.

The studio preview runs the same window.__hf inside an iframe with a postMessage bridge for play/pause/scrub.

The headless render runs the same window.__hf via Puppeteer and CDP. When a user scrubs the timeline, the preview iframe calls seek. When the renderer captures frame 147, it calls seek(147 / fps). Same code path. Same output.

Animation libraries plug in through a three-method FrameAdapter:

JavaScript code defining the `FrameAdapter` interface.

GSAP is the default because its timelines are already paused-and-seekable by design. timeline.pause() followed by timeline.totalTime(t, false) is the exact functionality we need. Lottie, CSS via WAAPI, and Three.js clocks all fit the same shape as well easily.

What doesn't fit is anything that insists on owning the clock: CSS keyframe animations without a controller, video elements, most canvas libraries running their own requestAnimationFrame.

For those you either wrap them in an adapter that takes the clock away, or you render them to frames offline and replay them as images. This is similar to how we handle videos which will explain more later.

Capture: Controlling Chrome Frame by Frame

The first version of the capture loop was four lines of Puppeteer:

Javascript code snippet showing `page.evaluate` to seek a position and `page.screenshot` to save an image frame.

Here are the four things that we ran into.

1. Page.captureScreenshot races the renderer

The call returns an image as soon as the compositor is willing to hand one over. That is not the same moment as "layout is done, fonts are loaded, the GSAP tween has committed its final style, and the GPU has finished painting."

You get frames where text hasn't rendered yet, where an SVG fill is still the unanimated default, where a video element is showing its 300x150 default size because metadata hasn't loaded. Every one of these is a frame that renders fine the second time and wrong the first.

We spent longer than we'd like to admit writing "did the frame land" heuristics: poll for fonts.ready, wait for computed styles, compare pixel hashes.

The heuristics works well, although it is not the most robust. This is still the path we use on macOS and Windows, where the fully deterministic alternative isn't available. It's not how you want to run production at scale. But more on that next.

2. HeadlessExperimental.beginFrame gives you that control

It's a CDP method that runs one layout→paint→composite→screenshot cycle atomically and returns the result:

JavaScript code using `cdp.send` to begin an experimental headless frame with screenshot options.

One call, one frame. The compositor is paused until you ask for the next one.

The response includes hasDamage, which tells you whether anything visually changed since the previous frame. There are no race conditions because there is no concurrent render pipeline still settling in the background.

You seek, you call beginFrame, you get the screenshot image.

Making this work requires a specific Chrome build and a specific set of flags. The binary is chrome-headless-shell, not regular Chrome. The flags are:

Bash terminal window displaying 8 numbered command-line flags, including `--deterministic-mode` and `--disable-threaded-animation`.

Every one of those flags is turning off a source of async scheduling: threaded compositor, threaded scrolling, incremental image decoding, image animation resync, vsync-based surface timing.

With them on, the compositor runs synchronously on the main thread and does not advance until CDP tells it to. With --deterministic-mode the time source is fixed too, so performance.now() is driven by the frameTimeTicks you pass in rather than by the system clock.

Constraint worth stating: this combination works on Linux with chrome-headless-shell.

On macOS and Windows we fall back to Page.captureScreenshot with the "did the frame land" heuristics, because Chrome on those platforms either crashes under --deterministic-mode or has its own issues with the flag combination.

For CI and production renders we run in Docker on Linux. Local dev works anywhere, with lower fidelity on non-Linux.

3. Chrome stops advancing its event loop

When --enable-begin-frame-control is active, Chrome's main thread stops ticking on its own.

No frame callbacks, no setTimeout, no microtask drain between tasks. Nothing runs until a beginFrame message arrives over CDP.

Which is great for determinism during capture. It is catastrophic during page load, because document.fonts.ready is a promise that resolves on a task, and tasks don't drain if nothing's ticking.

Your GSAP script loads. Your timeline registers on window.__timelines. window.__hf.seek is wired up. And then document.fonts.ready hangs forever.

The fix is a warmup loop. While the page is loading, we fire a beginFrame every 33ms with noDisplayUpdates: true, which advances the event loop without producing a frame:

JavaScript code simulating frames in a headless environment with a 33ms interval.

We kill the loop once window.__hf is ready and fonts have loaded, then start real capture at a frame time past the warmup range so the compositor never sees time going backwards. It's the kind of workaround you only know to write after the first render hangs at "loading fonts" until the timeout fires.

4. Puppeteer's waitForFunction stops working

The idiom for "wait until the page is ready" in Puppeteer is page.waitForFunction(...). Under the hood, that polls via requestAnimationFrame in the injected world.

rAF doesn't fire in beginFrame mode.

So waitForFunction hangs for the same reason fonts.ready does, except you lose the nice Puppeteer error message and get a generic timeout once the deadline fires.

The fix is to stop using waitForFunction and write the polling loop yourself with evaluate and setTimeout:

JavaScript code snippet displaying a while loop that checks for a page element's readiness with a 100ms delay.

Less clever. More obvious. Doesn't depend on anything running inside the page's frame loop.

After those four things, the capture loop that ships today is approximately this:

JavaScript code snippet with a for loop that captures video frames and saves them as JPG images.

One seek, one beginFrame, one frame on disk. No retries. No flaky frames. Deterministic.

The Video-in-Video Problem

Letting a browser play <video> at render time does not work.

In headless mode with BeginFrame on, video decoders skip frames, fail to decode, or sit at readyState: 0 long enough to break the capture deadline.

Even without BeginFrame, different machines and different codec paths produce different output on the same composition. What you see is not what you get.

A <video> on a webpage is happy to drop frames and call it good. A video renderer cannot.

So we took the decoding away from Chrome.

Before capture starts, FFmpeg pre-extracts every <video> in the composition into numbered JPEGs at the target fps.

A 5-second clip at 30fps becomes 150 files.

During capture, for each active video on the current frame, we inject an <img> sibling with the right frame's bytes as a data URI and hide the original <video>:

HTML/VBScript code showing a video element and a base64 encoded image of a captured video frame.

The interesting part is making the <img> look exactly like the <video> it replaced, so GSAP tweens, CSS transforms, opacity fades, and object-fit rules all keep working.

We read computed styles off the original element and copy them onto the injected image:

JavaScript code assigning an image's style properties like position, transform, opacity, and objectFit from computed styles, with a note indicating many more.

From the animation library's perspective, nothing has changed.

The element is in the same place with the same styles. It just happens to show a still image that changes every frame.

We don't let Chrome decode and schedule video. We hand it the exact frame we want, like a flipbook, and take a picture.

Others solve the same problem differently.

Remotion runs a long-running Rust compositor that decodes frames on demand and serves them over HTTP to its <OffthreadVideo> component. Replit demuxes frames in the browser with mp4box.js and decodes through WebCodecs, then paints into a <canvas>.

Ours is the simplest of the three: decode everything ahead of time in FFmpeg, serve JPEGs off disk. We trade flexibility (harder to handle blob URLs, streaming sources, dynamically-set src) for a much shorter pipeline.

There’s lots of improvements we can make here still, but the base works for our case great.

The Other Determinism Traps

Controlling time and rendering gets you most of the way. It does not get you all the way.

Fonts. Most compositions use Google Fonts via @import url(fonts.googleapis.com/...). That call is a coin flip at render time.

The network might be fast, slow, or blocked. The font might load before or after your first frame.

To get rid of the variance, we rewrite every Google Fonts @import in the compiled HTML to point at a local, base64-embedded copy of the font from @fontsource. The composition renders exactly the same, minus the network round-trip and the flakiness.

Time quantization. A 30fps video has a frame every 33.3333ms.

If the renderer calls seek(0.0333333) for frame 1 and seek(0.0333334) for some edge-case path that recomputes the time, we want those to be the exact same frame. So every seek, in both preview and render, runs through a quantizer:

JavaScript function `quantizeTimeToFrame` quantizing time to the nearest frame.

It's one line. It turns out to matter. Without it, two code paths that compute the same nominal time different ways can produce frames that differ by a pixel, and then you end up staring at a pixel-level diff wondering where it came from.

Rules for the author. No Date.now() in composition code. No unseeded Math.random(). No network fetches at render time.

These are part of the contract. If you violate them, you get nondeterministic output even with everything else we just built.

What You Get

The same window.__hf runtime bundle runs in the studio preview (inside an iframe) and in the headless render.

The renderer verifies a sha256 of the bundle against a manifest before starting, which means "what you see in the preview" is literally the same code that produced your MP4.

Preview and render parity isn't hoped for. It's enforced.

For longer videos, rendering gets split across N Chrome processes.

Each worker renders its share of frames, and FFmpeg concatenates the per-worker MP4 chunks at the end.

The one gotcha: video-heavy compositions can time out in parallel mode because Chrome can't seek multiple <video> elements simultaneously without running out of decoders.

The fix is to drop back to a single worker for video-heavy renders. Not elegant, but honest.

We didn't invent any of this from nothing.

GSAP's timelines are already paused-and-seekable by design, which is why the adapter is three methods. Remotion proved years ago that HTML could be a video format if you built the authoring model right.

Replit and Vinlic's WebVideoCreator pioneered time virtualization and BeginFrame capture for arbitrary web content.

We took a different path, a constrained authoring model with a seekable runtime contract, but the underlying techniques rest on work other people did first.

Why We Built HyperFrames

We think this is how agents will make video, and we think agents should be able to communicate through video.

That's why we open sourced it.

We want others to help extend this to further possibilities. We welcome contributions to our adapter system.

And we look forward to seeing what products are built on top of HyperFrames.

To try it just run

A code snippet in a bash terminal showing the command `npx skills add heygen-com/hyperframes`.

and tell your agent:

“/hyperframes create me a video about XYZ”.

Repo: github.com/heygen-com/hyperframes.

HTML to Video Was Not Easy: Here’s How We Solved It