💤Quietscore 64.8May 15, 2026·2605.16184cs.DCcs.LG

Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training

Yishun Lu, Junhao Zhang, Zeyu Yang, Wes Armour

Narrative

Second-order optimizers like SOAP and KL-Shampoo converge faster than Adam in terms of steps, but the cost of storing and updating large preconditioner matrices on GPU has made them impractical at scale. Asteria offloads optimizer state across GPU/CPU/NVMe tiers dynamically, runs inverse-root computations asynchronously on the host CPU while the GPU stays on the forward/backward pass, and uses a bounded-staleness protocol to reduce synchronization overhead in multi-node setups. On a single GB10 GPU with 128GB unified memory, this enables second-order training of a 1B model; on multi-node GH200 clusters, it reduces latency spikes and improves wall-clock convergence on 7B models without sacrificing optimizer quality.

No production traction yet. The GitHub references are all arXiv RSS scrapers and daily digest bots — none are implementing or extending Asteria. No citations registered on Semantic Scholar. The paper is very recent and the work appears to come from Oxford's systems group; there's no public code repo linked. Worth watching for teams running SOAP or distributed Shampoo variants who are hitting memory or synchronization walls, but nothing is shipping from this yet.

Abstract

Second-order methods offer an attractive path toward more sample-efficient LLM training, but their practical use is often blocked by the systems cost of maintaining and updating large matrix-based optimizer states. We introduce \textbf{Asteria}, a runtime system designed to remove this bottleneck by separating second-order optimization logic from the critical GPU training path. Rather than keeping all preconditioner state on the accelerator, Asteria dynamically distributes optimizer state across GPU memory, CPU memory, and optional NVMe storage according to architectural constraints and runtime pressure. It further uses training hooks to prepare shadow states in advance, allowing expensive inverse-root computations to proceed asynchronously on the host while GPU computation continues. For distributed training, Asteria employs a bounded-staleness protocol that limits synchronization frequency while preserving optimizer effectiveness through topology-aware coordination. We evaluate Asteria on both memory-constrained and distributed training settings. On a DGX Spark platform with a single GB10 GPU and 128GB unified memory, Asteria supports second-order training for a 1B-parameter language model. On multi-node GH200 systems, it lowers visible optimizer overhead, reduces recurring latency spikes, accelerates convergence in wall-clock time, and maintains the optimization advantages of SOAP and KL-Shampoo in a 7B-parameter language model. Our results suggest that second-order LLM training can be made practical not by simplifying the optimizer alone, but by rethinking how optimizer state, background computation, and distributed synchronization are managed at the runtime level.

Citation timeline

Not enough citation snapshots yet to plot a timeline. Come back after a few cron runs.

Signal

Stars: 35
Repos: 10
Citations: 0
Velocity: 0.00/d

GitHub repos (18)

luohongk/Embodied-AI-Daily⭐ 246
“| **[Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems](https://arxiv.org/abs/2605.16198v1)** | 2026-05-15 | | | **[paper.json: A Coordination Convention for LLM-Agent-Actionable Papers](https://arxiv.org/abs/2605.16194v1)** |”
ehijano/rss_fetch⭐ 11
“ </item> <item> <title>Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training</title> <link>https://arxiv.org/abs/2605.16184</link> <description>arXiv:2605.16184v1 Announce Type: new Abstract: Second-order methods offer an attractive pa”
ZenAlexa/agi-brief-history⭐ 11
“- **Summary**: Learning latent representations from complex data is central to modern machine learning, spanning temporal, multimodal, and partially observed systems. In such settings, representations are better understood as latent states capturing underlying system dynamics, ra”
lonePatient/lonePatient.github.io⭐ 9
“{% hideToggle 点击查看摘要 %} {% note blue no-icon %} ID-34-Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training {% endnote %} **链接**: https://arxiv.org/abs/2605.16184 **作者**: Yishun Lu,Junhao Zhang,Zeyu Yang,Wes Armour **类目**: Distributed, Parallel, and Cluster Co”
jyyang621/DailyArXiv⭐ 8
“| **[Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs](https://arxiv.org/abs/2605.00674v2)** | 2026-05-15 | | | **[Optimized Three-Dimensional Photovoltaic Structures with LLM guided Tree Search](https://arxiv.org/abs/2605.16191v1)** | 2026-05-15 ”
MayDomine/arxiv_rss_bot⭐ 3
“ --- ### 7. [Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training](https://arxiv.org/abs/2605.16184) **Authors**: Yishun Lu, Junhao Zhang, Zeyu Yang, Wes Armour **Category**: cs.DC ”
mumupika/my-Arxiv-Daily⭐ 1
“| 标题 | 作者 | 发布日期 | PDF | 摘要 | |------|------|----------|-----|------| | [Designing Datacenter Power Delivery Hierarchies for the AI Era](https://arxiv.org/abs/2605.16255v1) | Grant Wilkins, Fiodar Kazhamiaka, Alok Gautam Kumbhare, Chaojie Zhang, Ricardo Bianchini | 2026-05-15 | [”
nx1/nx1.github.io⭐ 1
“<a href="https://arxiv.org/abs/2605.16142">Property-Guided LLM Program Synthesis for Planning</a> <a href="https://arxiv.org/abs/2605.16145">Skew-adaptive conformal prediction</a> <a href="https://arxiv.org/abs/2605.16163">SwAIther-Precip: Lead-Time-Aware Bias Correction Enables ”
shaijing/arxiv-paper⭐ 0
“| **[Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs](https://arxiv.org/abs/2605.00674v2)** | 2026-05-15 | <details><summary>Show</summary>Large language models (LLMs) are becoming increasingly capable mathematical collaborators, but static ben”
Time-has-wings/MLSys-Papers⭐ 0
“ --- ## 🏗️ AI 基础设施\n\n### Asteria：面向可扩展LLM训练的运行时编排二阶优化系统\n- **英文标题**: Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training\n- **作者**: Yishun Lu, Junhao Zhang, Zeyu Yang, Wes Armour\n- **arXiv**: [2605.16184v1](https://arxiv.org/abs/2605.16184v1) | [PDF](http”
windrise/windrise.github.io⭐ 0
“ ], "primary_category": "cs.DC", "links": { "paper": "http://arxiv.org/abs/2605.16184v1", "pdf": "https://arxiv.org/pdf/2605.16184v1" }, "arxiv_id": "2605.16184v1",”
xiuguangli/DailyArxiv⭐ 0
“ "date": "2026-05-18", "date_url": "https://arxiv.org/catchup/cs.LG/2026-05-18?abs=True", "arxiv_id": "2605.16184", "abs_url": "https://arxiv.org/abs/2605.16184", "pdf_url": "https://arxiv.org/pdf/2605.16184", "title": "Runtime-Orche”
Jack-Zhuang/ai-daily-report⭐ 0
“ | 大语言模型 | <a href="https://arxiv.org/abs/2605.16184v1" target="_blank"> arXiv原文</a> </div”
iamhenryhuang/Daily-AI-Paper-Digest⭐ 0
“| 2 | [Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems](http://arxiv.org/abs/2605.16198v1) | 2026-05-15 | cs.AI, cs.CY, cs.LG, cs.LO | +2 提及頂級機構：mit; LLM rank #12 / 6/10: Combines formal methods with ML to enable auditing and”
Hdksg10/DailyArxiv⭐ 0
“| **[Designing Datacenter Power Delivery Hierarchies for the AI Era](https://arxiv.org/abs/2605.16255v1)** | 2026-05-15 | <details><summary>Show</summary>Demand for AI accelerators is rapidly increasing rack power density, with projections approaching 1MW per deployment by 202”
mghnasiri/PORID⭐ 0
“ { "title": "Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training", "authors": "Yishun Lu, Junhao Zhang, Zeyu Yang", "url": "http://arxiv.org/abs/2605.16184v1", "date": "2026-05-15" }, {”
mirae0708/steven⭐ 0
“ > **Source:** [arXiv](http://arxiv.org/abs/2605.16184v1) > **Category:** Artificial_Intelligence/AI_Agents”
pchaganti/pchaganti.github.io⭐ 0
“ { "title": "Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training", "summary": "Second-order methods offer an attractive path toward more sample-efficient LLM training, but their practical use is often blocked by the systems cost of maintaining ”