๐Ÿš€Shippingscore 79.4May 15, 2026ยท2605.16165cs.CVcs.AI

Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models

Yishun Lu, Wes Armour

Narrative

The core claim is that second-order optimization (specifically SOAP) handles the gradient heterogeneity between vision and language modalities better than AdamW in unified autoregressive models like Janus and Emu3. The proposed ML-FOP-SOAP adds Fisher-Orthogonal Projection to suppress cross-modal variance conflicts and a hierarchical gradient folding strategy to make second-order preconditioning tractable at large batch sizes. Reported gains are 1.4ร— sample efficiency and 1.5ร— wall-clock speedup over AdamW at batch size 8192 โ€” meaningful numbers if they hold across architectures, though the evaluation is limited to two models.

No production traction yet. Zero citations and all GitHub references are arxiv-tracking newsletters and daily digest bots, not implementations or forks. The work is too recent to assess adoption, but the practical angle โ€” fixing multimodal training instability with a drop-in optimizer โ€” is the kind of thing that gets picked up quickly by teams actively training unified image-text models if the results replicate.

Abstract

Autoregressive next-token training offers a unified formulation for image generation and text understanding, but it also creates strong modality competition that destabilizes optimization and limits large-batch scaling. We show that first-order optimizers such as AdamW are vulnerable to cross-modality gradient heterogeneity, while second-order preconditioning, particularly SOAP, provides a more stable basis for multimodal alignment. Building on this insight, we propose \emph{ML-FOP-SOAP}, a second-order optimization framework with Multi-Level Variance Correction. Our Fisher-Orthogonal Projection suppresses variance-induced modality conflicts, reducing the trade-off between visual generation and textual understanding. To make this practical under large gradient accumulation, we introduce a hierarchical folding strategy that captures fine-grained variance with low micro-step overhead. Experiments on Janus and Emu3 show consistent gains across both modalities and stable training at batch size 8192. Compared with AdamW, our method improves sample efficiency by up to $1.4\times$ and accelerates wall-clock training by up to $1.5\times$, offering a robust optimizer for scaling multimodal foundation models.

Citation timeline
Not enough citation snapshots yet to plot a timeline. Come back after a few cron runs.