Auto-Optim Achieves 3× AI Model Inference Speedup

Abstract

Squeezing the last drops of speed out of a model for a specific GPU has always been specialist work: slow, manual, and never quite finished. Most teams cast to bf16, wrap the thing in torch.compile, watch it get faster, and stop. For our decoder that is only two-thirds of the way. The rest is a long tail of fiddly, model-specific decisions, exactly the kind of patient, measurable grind an agent is good at and people are not.

So we built a closed loop that runs the grind on its own: propose a change, measure it against a frozen gate, keep it or revert it, and never declare victory by itself. We are publishing the whole thing, the loop, the evaluator, and the rule files, at github.com/heygen-com/auto-model-optim, along with the optimized decoder it produced. Point it at your own model by swapping three files. We think a pass like this should become the default way a model ships to a device. (It builds on the lineage of Andrej Karpathy's autoresearch, a keep-or-revert loop for model training; here we point the same machinery at the opposite problem, inference.)

What actually happened

Starting from a 14.47s fp32 baseline, the agent did what everyone does first, and it worked: bf16 for the tensor cores, the memory layout that lets cuDNN reach its fast kernels, a CUDA graph, then torch.compile. A bit over 2× faster, down to 6.64s. Then it got stuck.

It sat at 6.64s and started fiddling: toggling compiler flags, trying autotune modes, nudging knobs that each moved the number by a fraction of a percent. Its own profiler had already explained why torch.compile was barely helping, the streaming decode's host-side bookkeeping kept shattering the compiled graph into fragments, so there was nothing large left to fuse. It kept reaching for knobs anyway.

We did not tell it the answer. We changed the rules. One line into its spec: a sub-1% change is noise; if your last two wins were both noise, stop tuning knobs and make a structural change against your own number-one profiled cost. You program the org, not the Python.

A few experiments later it made the move itself. It rewrote the decoder's caching so the whole block compiles as a single graph, a "break-free decoder." The compiler fused across the block at once, and the time dropped from 6.64s to 4.82s in a single step, far more than any one fix should buy. A little polish took it to 4.79s, a clean 3.02×.

Then it stopped, not because it ran out of budget, but because the profile said there was nothing left to take: 99.8% GPU utilization, almost all of it inside the convolution kernels themselves, the elementwise work fully fused. It had hit the hardware floor, and it knew it. Recognizing the floor and stopping is itself a result.

A line graph shows performance (time in seconds) across 19 experiments, starting at 14.5s (fp32), dropping to 8.4s (bf16), plateauing around 6.6s (with one quality gate failure), and finally decreasing to 4.79s for a 3.02x speedup.

The whole climb, for the record:

config	latency	speedup
fp32 eager (baseline)	14.47s	1.00×
bf16	8.42s	1.72×
+ CUDA graph + native-pad conv	7.42s	1.95×
+ break-free decoder (full fusion)	4.82s	3.00×
final (refined)	4.79s	3.02×

But did it cheat, and did the video survive?

Two ways this goes wrong: the agent games the benchmark, or it quietly degrades the picture. Both were off the table by construction. The quality gate is frozen and the agent cannot touch it; every candidate has to match the original fp32 video (decoded from a public CC-BY clip, so the benchmark uses no proprietary data) within tolerance or it is thrown out, however fast. The winner actually came out cleaner than the mid-run configs (PSNR 61.3 dB). And to be sure, a second agent with a fresh context and no stake in the outcome re-ran the whole thing, reproduced 4.758s, and combed the diff for tricks: no cached outputs, no special-cased input, no fast path with a slow fallback. Real compute on every call. Faster only counts if it is the same video.

Why bother with this model

The decoder is the last stage of a video pipeline: it turns the model's latents into the frames a user actually watches, and it runs on every one. The transformer upstream gets all the attention and most of the optimization (flash-attention, fused kernels, years of community work); the VAE gets left behind. Its cost scales with resolution, so at 1080p and especially 4K it is frequently the part that is actually slow. A real, recurring bill, paid by everyone who builds on the model, and almost nobody tunes it. A good target.

The trick: a game the agent can't cheat and can't quit

Tell an agent "make it faster, keep the quality" and it does one of two things: it games the number you measure, or it stops the moment the work looks done. So you remove both options. A frozen evaluator it cannot edit defines the only notion of success. A persistent log, every attempt, a leaderboard, a dead-end journal, plus a standing order never to conclude on its own, keeps it honest and keeps it moving. The agent edits exactly one file; the harness, not the agent, decides keep or revert and commits the keepers to git. The history becomes the research record.

Flowchart for an iterative code optimization process: `optimize.py` is edited, then passes through a `verify.py` quality gate and `bench.py` latency test. If changes pass verification and are faster, they are kept; otherwise, they are reverted, restarting the edit process. A harness, not the agent, makes the keep/revert decision.

We did not invent this shape. KernelBench, Sakana's CUDA Engineer, Meta's KernelAgent, Kevin-32B and others have pointed LLMs at GPU code, and Karpathy's autoresearch is the closest template (though it is a training loop; we aim the same machinery at inference). They all teach one lesson, Sakana the hard way, when a headline 3.13× collapsed to 1.49× once the benchmark-gaming was stripped out: the number you optimize has to be impossible to fake. Hence the frozen gate.

And notice what actually won here. Not a hand-written kernel: dtype, layout, killing graph breaks, and letting the compiler fuse. This is deployment engineering across the whole config, not kernel golf, which is exactly why it transfers to other models and other GPUs.

How it's wired, in Claude Code

If you want to build your own, here is the mapping from idea to files.

The rules live in two always-loaded files: AGENTS.md, the short contract (explicit always / ask-first / never), and program.md, the full runbook, loaded on demand as a slash command.
The driver is /goal, a finish condition Claude Code re-checks every turn. Do not cap it at a number of experiments; the right stop is a plateau plus a profile that shows no headroom left.
A scoped subagent does the editing, restricted to optimize.py. A Stop hook refuses to end a turn until the experiment is logged and the gate is green. And every kept result auto-runs a detailed profiler, so the next idea comes from where the time actually goes, not a guess.

Diagram showing files grouped into 'Mutable' (optimize.py), 'Frozen Read-Only' (e.g., goals.json, verify.py), and 'Durable Record' (e.g., results.tsv, PROGRESS.md).

The single most important thing we did during the run was not a code hint. It was that one-line edit to program.md when the agent started nibbling. When your agent misbehaves, fix the rules, not the code. That edit is now baked into the published spec.

And that spec is the real deliverable. program.md is not ours alone: it distills the field's best practices into one file, Karpathy's keep-or-revert and never-stop, the kernel agents' rule that the verifier is adversarial-grade infrastructure, Meta's structured reflection, Anthropic's guidance for long-horizon agents, and then layers on what we learned the hard way, a pre-seeded dead-end log, an output-parity gate built for inference, and the anti-nibbling rule from the story above. The best of the field, plus our own scar tissue, in a file you can copy.

Get the kit, and the optimized decoder

Everything is open at github.com/heygen-com/auto-model-optim. Two files carry the whole method: program.md and AGENTS.md. The repo also ships the result, the winning optimize.py for the Wan 2.2 decoder, so if you run a Wan-2.2 pipeline you can drop in the optimized decode and take the 3.02× today. One caveat by design: it is tuned for a single H100. On other hardware, treat it as a starting point and rerun the loop.

The fastest way to retarget is to let Claude Code do the wiring: open the repo and ask the agent to point it at your model (nothing is hardcoded to a VAE; the harness is generic and the model-optimizer subagent is model-agnostic). By hand: drop your model and one fixed input into model/ and assets/, return your forward pass from optimize.py, and define your metric and gate in goals.json (decode latency and output-parity for us; tokens per second and an equivalence bound for an LLM; step latency and FID/LPIPS for a diffusion model). One rule first, whatever the model: pin a random seed, or run-to-run noise will masquerade as a speedup and the keep-or-revert call becomes meaningless. Then:

Plain text

huggingface-cli download Wan-AI/Wan2.2-TI2V-5B Wan2.2_VAE.pth --local-dir ./assets
export CUDA_VISIBLE_DEVICES=0
python harness/make_latent.py               # downloads a CC-BY clip, encodes the fixed input
python harness/make_reference.py            # fp32 decode of it = the frozen quality target
python harness/run_experiment.py --desc "baseline"
# then, in Claude Code, set the /goal and hand it the wheel

The bet

The division of labor is the whole point. We wrote the rules, the metric, and the gate. The agent supplied the patience, the part we find tedious and it does not, and made the one creative leap once we sharpened the spec. The expensive thing, the 105-experiment slog that taught us those rules, is done once and lives in a file now; the next model inherits it for the price of a few GPU-hours.

Our bet: within a couple of years, shipping a model to a device without a pass like this will look as careless as shipping code with no tests. The kit is open. Point it at your model and let it run.

References

Wan 2.2 (model + VAE): GitHub · HuggingFace (Apache-2.0)
This harness (auto_model_optim): github.com/heygen-com/auto-model-optim
KernelBench (Stanford): repo · arXiv 2502.10517
Sakana AI CUDA Engineer / robust-kbench: arXiv 2509.14279 · blog
Meta KernelAgent: repo · blog
Kevin-32B (Cognition): blog · arXiv 2507.11948
GPU MODE / KernelBot: repo · AlphaEvolve arXiv 2506.13131 · OpenEvolve repo
Karpathy autoresearch: repo · Anthropic, "Effective harnesses for long-running agents", essay

Auto-Optim: A Blind Agent Matched Our Hard-Won 3x, With No human in the Loop