Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training
Yishun Lu, Junhao Zhang, Zeyu Yang, Wes Armour
Second-order optimizers like SOAP and KL-Shampoo converge faster than Adam in terms of steps, but the cost of storing and updating large preconditioner matrices on GPU has made them impractical at scale. Asteria offloads optimizer state across GPU/CPU/NVMe tiers dynamically, runs inverse-root computations asynchronously on the host CPU while the GPU stays on the forward/backward pass, and uses a bounded-staleness protocol to reduce synchronization overhead in multi-node setups. On a single GB10 GPU with 128GB unified memory, this enables second-order training of a 1B model; on multi-node GH200 clusters, it reduces latency spikes and improves wall-clock convergence on 7B models without sacrificing optimizer quality.
No production traction yet. The GitHub references are all arXiv RSS scrapers and daily digest bots — none are implementing or extending Asteria. No citations registered on Semantic Scholar. The paper is very recent and the work appears to come from Oxford's systems group; there's no public code repo linked. Worth watching for teams running SOAP or distributed Shampoo variants who are hitting memory or synchronization walls, but nothing is shipping from this yet.
Second-order methods offer an attractive path toward more sample-efficient LLM training, but their practical use is often blocked by the systems cost of maintaining and updating large matrix-based optimizer states. We introduce \textbf{Asteria}, a runtime system designed to remove this bottleneck by separating second-order optimization logic from the critical GPU training path. Rather than keeping all preconditioner state on the accelerator, Asteria dynamically distributes optimizer state across GPU memory, CPU memory, and optional NVMe storage according to architectural constraints and runtime pressure. It further uses training hooks to prepare shadow states in advance, allowing expensive inverse-root computations to proceed asynchronously on the host while GPU computation continues. For distributed training, Asteria employs a bounded-staleness protocol that limits synchronization frequency while preserving optimizer effectiveness through topology-aware coordination. We evaluate Asteria on both memory-constrained and distributed training settings. On a DGX Spark platform with a single GB10 GPU and 128GB unified memory, Asteria supports second-order training for a 1B-parameter language model. On multi-node GH200 systems, it lowers visible optimizer overhead, reduces recurring latency spikes, accelerates convergence in wall-clock time, and maintains the optimization advantages of SOAP and KL-Shampoo in a 7B-parameter language model. Our results suggest that second-order LLM training can be made practical not by simplifying the optimizer alone, but by rethinking how optimizer state, background computation, and distributed synchronization are managed at the runtime level.