🚀Shippingscore 81.3May 15, 2026·2605.16241cs.CVcs.AI

Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation

Jin Shi, Brady Zhang, Yishun Lu

Narrative

Distilling a 7B-parameter VLA policy (OpenVLA) into a 158M-parameter student by adding offline semantic supervision during training — specifically task phase labels and multi-frame directional descriptions generated by a VLM — closes 99.73% of the teacher's performance on LIBERO benchmarks. At inference, neither the teacher nor the VLM is needed; the student runs at 12.5 Hz on an RTX 4090, a 3.28× speedup over OpenVLA-7B, and the same pipeline transfers to a π0.5-4B teacher where the student actually outperforms the teacher on two of three benchmark suites.

No production traction yet. Zero citations and all referencing repos are arxiv digest aggregators, not implementations. The LIBERO results are simulation-only, and there's no released code or model weights visible. Worth watching for robotics teams trying to deploy OpenVLA or π0 derivatives on edge hardware, but this is still a preprint with no external validation.

Abstract

Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbf{VLA-AD}, a distillation framework that uses a Vision-Language Model as an offline semantic supervisor to transfer large VLA teachers into lightweight student policies. Instead of relying only on low-level action imitation, VLA-AD augments teacher-provided 7-DoF action targets with high-level semantic guidance, including task phase anchors and multi-frame operating-direction descriptions. These auxiliary signals are used only during training: at test time, the student policy runs independently, with neither the VLA teacher nor the VLM required. We evaluate VLA-AD on three LIBERO benchmark suites. Using OpenVLA-7B as the teacher, our method produces a 158M-parameter student, yielding a $44\times$ reduction in model size while matching the teacher with only a $0.27\%$ average relative gap. The resulting policy runs at 12.5 Hz on an RTX 4090, achieving a $3.28\times$ inference speedup over OpenVLA-7B. We further show that the same semantic distillation pipeline generalizes to a different $π_{0.5}$-4B teacher, where the student outperforms the teacher on two suites and remains within $0.53\%$ on \texttt{libero\_goal}. Additional analysis indicates that phase-level supervision and multi-frame directional cues make the student less sensitive to noisy teacher actions, such as erroneous high-frequency gripper changes. Overall, VLA-AD demonstrates that offline semantic guidance from VLMs can substantially improve the efficiency, robustness, and deployability of VLA policy distillation.

Citation timeline

Not enough citation snapshots yet to plot a timeline. Come back after a few cron runs.

Signal

Stars: 220
Repos: 10
Citations: 0
Velocity: 0.00/d

GitHub repos (14)

Tavish9/awesome-daily-AI-arxiv⭐ 92
“ Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their computational cost becomes substantial, especially for larger VLMs. To distill such capabilities into comp”
Owen-Liuyuxuan/papers_reading_sharing.github.io⭐ 59
“<a id='2605.16241v1'></a> ## [Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation](https://arxiv.org/abs/2605.16241v1) ”
CSQianDong/Awesome-arXiv-Daily-Reporter⭐ 47
“{'arxiv_id': 'arXiv:2605.16255', 'title': 'Designing Datacenter Power Delivery Hierarchies for the AI Era', 'authors': 'Grant Wilkins, Fiodar Kazhamiaka, Alok Gautam Kumbhare, Chaojie Zhang, Ricardo Bianchini', 'link': 'https://arxiv.org/abs/2605.16255', 'abstract': 'Demand for A”
Ed1sonChen/DailyArxiv⭐ 38
“| --- | --- | --- | --- | | **[Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation](https://arxiv.org/abs/2605.16241v1)** | 2026-05-15 | <details><summary>Show</summary><p>Billion-parameter Vision-Language-Action (VLA) policies have recently shown i”
lonePatient/lonePatient.github.io⭐ 9
“{% hideToggle 点击查看摘要 %} {% note blue no-icon %} ID-6-Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation {% endnote %} **链接**: https://arxiv.org/abs/2605.16241 **作者**: Jin Shi,Brady Zhang,Yishun Lu **类目**: Computer Vision and Pattern Recognition (c”
2shin0/arxiv-ai-mailing⭐ 6
“ ## 36. Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation - **Authors**: Jin Shi , Brady Zhang , Yishun Lu - **URL**: [https://arxiv.org/abs/2605.16241](https://arxiv.org/abs/2605.16241) - **Abstract**: > Billion-parameter Vision-Language-Action (”
ttmens/ai-radar-wiki⭐ 5
“VLA-AD是一种用于视觉-语言-动作策略的高效蒸馏框架，通过离线语义引导将数十亿参数的大模型压缩成轻量级模型，实现实时闭环机器人控制。该技术显著降低了推理成本和延迟，同时保持任务性能，为具身AI的商业化部署提供了可行路径。产品创新在于利用预训练大模型的语义知识指导小模型学习，无需在线交互，适合资源受限的机器人场景。 ## 链接 - 📄 arXiv: http://arxiv.org/abs/2605.16241v1 ## PM 视角解读 > 由 Stage 2 LLM 分析后补充”
ValoraY/arXiv-daily⭐ 1
“<hr /> <h4 id="abstract_17">📄 Abstract</h4> <p>Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, where a model must follow a selected visual path thro”
Infinity4B/daily-arxiv-vla⭐ 1
“ <div class="detail-hero-copy"> <p class="eyebrow">论文详情</p> <h1 class="detail-page-title">Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation</h1> <div class="detail-meta">2026-05-15 · <a href="http://arx”
xiuguangli/DailyArxiv⭐ 0
“ "date": "2026-05-18", "date_url": "https://arxiv.org/catchup/cs.CV/2026-05-18?abs=True", "arxiv_id": "2605.16241", "abs_url": "https://arxiv.org/abs/2605.16241", "pdf_url": "https://arxiv.org/pdf/2605.16241", "title": "Offline Seman”
mickdur/tech-watch⭐ 0
“ "https://arxiv.org/abs/2605.16233": "2026-05-18T07:51:44.206446+00:00", "https://arxiv.org/abs/2605.16234": "2026-05-18T07:51:44.206446+00:00", "https://arxiv.org/abs/2605.16238": "2026-05-18T07:51:44.206446+00:00", "https://arxiv.org/abs/2605.16241": "2026-05-18T07:51:44”
yangbc2015/yangbc2015.github.io⭐ 0
“pdf_url: https://arxiv.org/pdf/2605.16241v1.pdf arxiv_url: https://arxiv.org/abs/2605.16241v1 tags:”
sirichen2/sirichen2.github.io⭐ 0
“ "Brady Zhang", "Yishun Lu" ], "abs_url": "https://arxiv.org/abs/2605.16241v1", "pdf_url": "https://arxiv.org/pdf/2605.16241v1", "published": "2026-05-15T17:48:25+00:00", "updated": "2026-05-15T17:48:25+00:00",”
bspiegel27/bst_236_website⭐ 0
“ <entry> <id>http://arxiv.org/abs/2605.16241v1</id> <title>Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation</title>”