HELIOS: Unified GPU Infrastructure for AI at Scale

Behind every AI product is a compute platform. At HeyGen, that platform powers model training, data processing, and production inference across every model we ship. It coordinates over 5,000 GPUs across multiple cloud providers, supporting workloads including large-scale distributed training, low-latency production inference, and massive data processing pipelines—each with different requirements for scheduling, networking, storage, and fault tolerance.

Most companies at this scale cobble together separate systems for each workload—a training cluster here, an inference fleet there, a data pipeline on yet another set of machines. We took a different approach. HELIOS (HeyGen Engine for Large-scale Infrastructure Orchestration Service) is a single unified platform that manages all GPU workloads through one control plane, one scheduling system, and one set of operational tooling. Every model HeyGen trains and serves runs on HELIOS.

Multi-Cloud Orchestration

GPU availability is the most constrained resource in AI infrastructure today. No single cloud provider can reliably supply the volume and variety of GPUs needed. H100s might be available on one provider while H200s are cheaper on another. Spot pricing fluctuates hourly. Entire regions can run out of capacity during peak demand.

HELIOS abstracts away the cloud provider layer. Workloads are defined in terms of their resource requirements—GPU type, memory, networking, storage—and the scheduler places them on the best available hardware across all providers. Today, the platform manages resources across 5+ providers, 10+ regions, and 15+ standardized cells, supporting reserved, on-demand, and preemptible capacity under one system.

Standardized Onboarding

Before HELIOS, onboarding a new GPU provider or region required repeating the same engineering work—adapting networking, storage, cluster management, monitoring, and operational workflows—for each new environment. HELIOS replaces this with a standard onboarding path: common validation, acceptance checks, and baseline infrastructure setup before joining the platform. Once admitted, resources are exposed through the same management model as the rest of the fleet. This reduced average onboarding time from two weeks to three days.

Cell-Based Architecture

Rather than building one monolithic cluster, HELIOS organizes the fleet into standardized cells, typically aligned by provider and region. Each cell is a Kubernetes cluster with a validated size boundary and a common operational baseline. This design limits blast radius: a problem in one cell is far less likely to propagate fleet-wide. It also provides a clean growth path—capacity is added through new standard cells rather than stretching a single control plane.

Training workloads run within a single cell to maximize interconnect bandwidth. A distributed training job that needs 64 GPUs gets placed on machines within the same cell, ensuring gradient synchronization happens over high-speed InfiniBand or NVLink rather than cross-datacenter networking. Inference workloads are more flexible—individual requests need only 1–8 GPUs and can run on any cell with available capacity. Data processing is the most flexible of all, designed to tolerate high latency between stages, making it an ideal backfill workload for otherwise idle GPU capacity.

Two-Stage QoS-Aware Scheduling

Inference workloads require higher priority and faster response. Training workloads need larger, more stable allocations over longer periods. Data processing can tolerate interruption. Treating all workloads identically would either waste expensive capacity or create contention for critical services.

HELIOS uses a two-stage scheduling model: a global scheduler makes capacity decisions based on workload QoS class, GPU type, request size, and supply model, while the selected cell handles local deployment and placement. This improved overall GPU utilization by 15% and reduced non-productive GPU time by approximately 20%.

Continuous Resource Governance

The platform continuously monitors health signals across the fleet, including GPU, PCIe, and NCCL-related conditions. Unhealthy nodes are automatically isolated and routed through recovery workflows. It also detects long-idle or low-utilization resources by combining GPU utilization, memory usage, task state, and runtime progress signals, reclaiming and reallocating capacity according to workload priority.

Unified Observability

HELIOS collects signals from infrastructure, clusters, workloads, and applications, applying different sampling and retention strategies depending on the use case. Beyond standard metrics, traces, and logs, the platform adds finer-grained network-side observability on key nodes to identify communication bottlenecks and performance jitter in training and inference scenarios. This shared operational view shortens debugging loops, improves capacity planning, and makes cost attribution tractable across teams.

Data Processing Engine

HELIOS manages the fleet. But the largest consumer of that fleet—the video data processing pipelines that feed training for all of HeyGen’s models—needed its own purpose-built engine. These pipelines originally ran on Ray, the standard choice for distributed Python workloads. Ray worked well initially, but as demand surged to 100K+ concurrent tasks across 2,000+ nodes, fundamental scalability limits emerged.

Four constraints define the operating environment:

Heterogeneous pipeline stages. A video processing DAG mixes IO-bound media decode, GPU-bound model inference, and CPU-bound encoding. The scheduler must be resource-profile-aware.
Priority-based scheduling. Latency-critical pipelines require preemptive priority; background pipelines must yield immediately when higher-priority work arrives.
GPU fragmentation. When a stage needs 4 GPUs but available capacity is scattered as single-GPU slots across nodes, those GPUs are effectively stranded.
Constant node preemption. Data processing is colocated with production inference, so machines can be reclaimed at any time. Node failure is a continuous operating condition.

Under sustained load, Ray’s Global Control Store (GCS) became the bottleneck:

100 GB+ RSS memory consumption
400% CPU utilization
Quadratic scaling — every state change broadcast to every node
Process crashes that took down the entire coordination layer

After extensive stabilization attempts (parameter tuning, cache capping, timeout reduction, telemetry disabling), each yielding only marginal improvement, we concluded we had outgrown Ray’s architecture.

Declarative Reconciliation Architecture

The core design decision is a declarative engine, drawing from the same principles behind Kubernetes. Instead of issuing imperative commands (“start actor X on node Y”), the control plane declares desired state in a distributed key-value store. Nodes independently converge local state toward the declared target through an observe → diff → reconcile loop, replacing the entire command dispatch, acknowledgment, retry, and rollback complexity.

A diagram comparing a Command Model, where a control plane sends direct instructions to nodes with associated problems like lost commands, against a Declarative Model, where a control plane defines a desired state for nodes to reconcile, offering benefits like idempotence and crash-safety.

Figure 1: Command model vs. declarative model. In the command model (left), the control plane sends instructions directly to nodes, leading to lost commands and accumulated queues. In the declarative model (right), the control plane publishes desired state to a KV store, and nodes independently observe, diff, and reconcile—yielding idempotent, crash-safe operation.

This design is particularly well-suited to the workload: actors load large models onto GPUs with initialization costs measured in seconds to minutes. Once running, actors should remain in place to amortize that cost. The system optimizes for placement stability, not churn.

The engine is organized into four layers, each with a single responsibility:

System architecture diagram showing a Control Plane, a Distributed KV Store, and Node Agents managing GPU and CPU workers.

Figure 2: Four-layer architecture of the data processing engine. The control plane (scheduler and pipeline controllers) publishes desired state to a distributed KV store. Node agents observe state changes and reconcile local workers accordingly. Each layer can be independently restarted without affecting running pipelines.

Layer 1 — Scheduler. Reads demand, metrics, and capacity from the KV store and publishes placement assignments. GPU-bound stages are packed densely to minimize fragmentation; IO/CPU-bound stages are spread for throughput balance.
Layer 2 — Pipeline Controllers. One controller per pipeline declares demand for its stages, wires the pipeline DAG, and manages lifecycle. Controllers are fault-isolated: a crash affects only its pipeline, and restart reconverges idempotently from the KV store.
Layer 3 — Node Agents. One agent per worker node runs the core reconciliation loop—read desired state, compare with actual processes, spawn or kill workers to close the gap—and reports status periodically.
Layer 4 — Workers. Stateless, single-purpose, disposable processes. Each hosts one task instance, pulls work from a queue, and writes output. Zero coordination logic, no awareness of peers or topology. Scale-out and crash replacement use the same code path.

Results

A table comparing system performance metrics Before (Ray) and After (Custom Engine), highlighting improvements in GPU utilization, node failure detection, concurrent tasks, and GPU nodes supported.

Table 1: Operational improvements after replacing Ray with the custom declarative engine.

Priority-aware scheduling, hardware-aware bin-packing, and dynamic capacity reallocation eliminate GPU fragmentation. Background pipelines absorb every idle cycle that inference and latency-critical pipelines leave behind. Any transient failure—scheduler restart, KV failover, or network partition—does not affect running tasks. Workers continue processing, node agents continue reconciling, and when the control plane recovers, it reads current state and resumes without losing in-flight work.

Why This Matters

Infrastructure is rarely the headline, but it determines what is possible. HELIOS lets HeyGen’s engineering teams treat 5,000+ GPUs across multiple clouds as a single pool, schedule three fundamentally different workloads through one control plane, and recover from failures in seconds rather than minutes. The declarative data processing engine replaced hundreds of lines of actor code with configuration, cut operational incidents from hours to minutes, and achieved linear scalability where the previous architecture hit a hard wall. These are the systems that make HeyGen’s products possible at scale—and every model that comes next.

HELIOS: Unified GPU Infrastructure for Training, Inference, and Data at Scale