🚀Shippingscore 81.3May 15, 2026·2605.16241cs.CVcs.AI

Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation

Jin Shi, Brady Zhang, Yishun Lu

Narrative

Distilling a 7B-parameter VLA policy (OpenVLA) into a 158M-parameter student by adding offline semantic supervision during training — specifically task phase labels and multi-frame directional descriptions generated by a VLM — closes 99.73% of the teacher's performance on LIBERO benchmarks. At inference, neither the teacher nor the VLM is needed; the student runs at 12.5 Hz on an RTX 4090, a 3.28× speedup over OpenVLA-7B, and the same pipeline transfers to a π0.5-4B teacher where the student actually outperforms the teacher on two of three benchmark suites.

No production traction yet. Zero citations and all referencing repos are arxiv digest aggregators, not implementations. The LIBERO results are simulation-only, and there's no released code or model weights visible. Worth watching for robotics teams trying to deploy OpenVLA or π0 derivatives on edge hardware, but this is still a preprint with no external validation.

Abstract

Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbf{VLA-AD}, a distillation framework that uses a Vision-Language Model as an offline semantic supervisor to transfer large VLA teachers into lightweight student policies. Instead of relying only on low-level action imitation, VLA-AD augments teacher-provided 7-DoF action targets with high-level semantic guidance, including task phase anchors and multi-frame operating-direction descriptions. These auxiliary signals are used only during training: at test time, the student policy runs independently, with neither the VLA teacher nor the VLM required. We evaluate VLA-AD on three LIBERO benchmark suites. Using OpenVLA-7B as the teacher, our method produces a 158M-parameter student, yielding a $44\times$ reduction in model size while matching the teacher with only a $0.27\%$ average relative gap. The resulting policy runs at 12.5 Hz on an RTX 4090, achieving a $3.28\times$ inference speedup over OpenVLA-7B. We further show that the same semantic distillation pipeline generalizes to a different $π_{0.5}$-4B teacher, where the student outperforms the teacher on two suites and remains within $0.53\%$ on \texttt{libero\_goal}. Additional analysis indicates that phase-level supervision and multi-frame directional cues make the student less sensitive to noisy teacher actions, such as erroneous high-frequency gripper changes. Overall, VLA-AD demonstrates that offline semantic guidance from VLMs can substantially improve the efficiency, robustness, and deployability of VLA policy distillation.

Citation timeline
Not enough citation snapshots yet to plot a timeline. Come back after a few cron runs.