🚀Shippingscore 110.0May 26, 2026·2605.27365cs.CVcs.AIcs.LGcs.RO

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Shihao Wang, Shilong Liu, Yuanguo Kuang, Xinyu Wei, Yangzhou Liu, Zhiqi Li, Yunze Man, Guo Chen, Andrew Tao, Guilin Liu, Jan Kautz, Lei Zhang, Zhiding Yu

PDF ↗arXiv ↗

Narrative

No narrative written yet. The narrate cron picks top papers by score; run /api/cron/narrate to populate this manually.

Abstract

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.

Citation timeline

Not enough citation snapshots yet to plot a timeline. Come back after a few cron runs.

Signal

Stars: 239
Repos: 51
Citations: 0
Velocity: 0.00/d

GitHub repos (20)

Tavish9/awesome-daily-AI-arxiv⭐ 97
“ Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task -- given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent Vision Language Models (VLMs) often fail to predict acc”
Ed1sonChen/DailyArxiv⭐ 51
“## Vision Language Model | **Title** | **Date** | **Abstract** | **Comment** | | --- | --- | --- | --- | | **[LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding](https://arxiv.org/abs/2605.27365v1)** | 2026-05-26 | <details><summary>Show</s”
wwd29/arxiv-daily⭐ 21
“<ul> <li>Authors: Shihao Wang, Shilong Liu, Yuanguo Kuang, Xinyu Wei, Yangzhou Liu, Zhiqi Li, Yunze Man, Guo Chen, Andrew Tao, Guilin Liu, Jan Kautz, Lei Zhang, Zhiding Yu</a></li> <li>Subjects: cs.CV, cs.AI, cs.LG, cs.RO</a></li> <li>Abs”
ZenAlexa/agi-brief-history⭐ 11
“- **Summary**: User modeling aims to use language models (LMs) to mimic an individual's behavior from a corpus of past context-action pairs (e.g., conversation turns), enabling the simulation of users in settings like behavioral science, human-AI collaboration, and market researc”
ehijano/rss_fetch⭐ 11
“ </item> <item> <title>LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding</title> <link>https://arxiv.org/abs/2605.27365</link> <description>arXiv:2605.27365v1 Announce Type: cross Abstract: Vision-language models ”
lonePatient/lonePatient.github.io⭐ 9
“{% hideToggle 点击查看摘要 %} {% note blue no-icon %} ID-4-LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding {% endnote %} **链接**: https://arxiv.org/abs/2605.27365 **作者**: Shihao Wang,Shilong Liu,Yuanguo Kuang,Xinyu Wei,Yangzhou Liu,Zhiqi Li,Yu”
sifted-network/sifted-awesome-ai-agents⭐ 7
“ arXiv:2605.27365v1 Announce Type: cross Abstract: Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This to”
2shin0/arxiv-ai-mailing⭐ 7
“ ## 39. LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding - **Authors**: Shihao Wang , Shilong Liu , Yuanguo Kuang , Xinyu Wei , Yangzhou Liu , Zhiqi Li , Yunze Man , Guo Chen , Andrew Tao , Guilin Liu , Jan Kautz , Lei Zhang , Zhiding Yu ”
ttmens/ai-radar-wiki⭐ 6
“ </item> <item> <title>LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Bo</title> <link>http://arxiv.org/abs/2605.27365v1</link> <guid isPermaLink="false">locateanything-fast-and-high-quality-vision-language-grounding-with-pa”
AtomChen0425/Global_Trends⭐ 4
“</details> --- ### 3. [LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding](https://arxiv.org/abs/2605.27365v1) 👤 **Authors:** Shihao Wang, Shilong Liu, Yuanguo Kuang <details> <summary>📄 论文摘要: **背景**”
bailynlove/Rookie-s-Newsletters⭐ 3
“ <td>P2</td> </tr> <tr> <td><a href="https://arxiv.org/abs/2605.27365">LocateAnything: Vision-Language Grounding</a></td> <td>cs.CV</td> <td>05-26</td> <td>P2</td>”
tryigit/cleveres-ai⭐ 2
“**Category:** Frontier / Research Paper **Paper:** [LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding](https://arxiv.org/abs/2605.27365) **Date:** May 2026”
ValoraY/arXiv-daily⭐ 1
“{"id": "2605.27295", "categories": ["cs.CV"], "pdf": "https://arxiv.org/pdf/2605.27295", "abs": "https://arxiv.org/abs/2605.27295", "authors": ["Madhuri Shanbhogue", "Zhe Li", "Shanfeng Zhang", "Gustavo Hernández Ábrego", "Shih-Cheng Huang", "Aashi Jain", "Daniel Salz", "Sonam Go”
quanyushi/DailyArxiv⭐ 1
“## VLM for Navigation and Localization | **Title** | **Date** | **Comment** | | --- | --- | --- | | **[LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding](https://arxiv.org/abs/2605.27365v1)** | 2026-05-26 | | | **[EpiCurveBench: Evaluatin”
osa-mayor/DailyUpdate⭐ 1
“ ], "chat_message": "🔥 오늘의 논문\n\n📌 오늘의 요약: 오늘의 연구는 고성능 비디오 및 오디오-비주얼 생성 모델을 위한 정밀한 평가 벤치마크 구축과 효율적인 모델 아키텍처 설계에 집중되었습니다. 또한, 모바일 에이전트 및 3D 재구성 등 실무적인 AI 서비스 구현을 위한 데이터 품질 관리와 효율적인 추론/검증 프레임워크 연구가 활발히 진행되었습니다.\n\n📍 1. LocateAnything: Fast and High-Quality Vision-Language Gro”
shubhamshardul-work/Projects⭐ 1
“ ### **[LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding](http://arxiv.org/abs/2605.27365v1)** "LocateAnything" introduces a new method for vision-language grounding and detection that uses parallel box decoding to achieve faster and high”
JPM2002/JPM2002⭐ 1
“| # | Title | Cat. | Date | Links | |:-:|:------|:----:|:----:|:------| | 1 | **[SpatialBench: Is Your Spatial Foundation Model an All-Round Player?](https://arxiv.org/pdf/2605.27367v1)** Haosong Peng, Hao Li, Jiaqi Chen, Yuhao Pan, Runmao Yao, Yalun Dai, Fushuo Huo, Fang”
BraveBoBo/paper-subscription⭐ 1
“ ## arXiv cs.RO (38) ### [LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding](https://arxiv.org/abs/2605.27365v1) *Shihao Wang, Shilong Liu, Yuanguo Kuang, Xinyu Wei, Yangzhou Liu, Zhiqi Li, Yunze Man, Guo Chen, Andrew Tao, Guilin Liu, Jan”
aparasion/turingwire⭐ 1
“source_publisher: "arXiv cs.AI" source_url: "https://arxiv.org/abs/2605.27365v1" arxiv_id: "2605.27365"”
Kiraaa1/ArXic-AI-Paper-Digest-Agent⭐ 1
“ ### 3. LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding **Authors:** Shihao Wang, Shilong Liu, Yuanguo Kuang, Xinyu Wei, Yangzhou Liu, Zhiqi Li, Yunze Man, Guo Chen, Andrew Tao, Guilin Liu, Jan Kautz, Lei Zhang, Zhiding Yu **Link:** http”