SP
BravenNow
Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation
| USA | technology | ✓ Verified - arxiv.org

Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

#Perceptio #vision-language models #spatial token generation #perception enhancement #multimodal AI

📌 Key Takeaways

  • Perceptio introduces spatial token generation to enhance vision-language models.
  • The method improves perception capabilities in multimodal AI systems.
  • It focuses on generating spatial tokens for better visual understanding.
  • The approach aims to bridge gaps between visual data and language processing.

📖 Full Retelling

arXiv:2603.18795v1 Announce Type: cross Abstract: Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we

🏷️ Themes

AI Enhancement, Multimodal Models

Entity Intersection Graph

No entity connections available yet for this article.

}
Original Source
--> Computer Science > Computer Vision and Pattern Recognition arXiv:2603.18795 [Submitted on 19 Mar 2026] Title: Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation Authors: Yuchen Li , Amanmeet Garg , Shalini Chaudhuri , Rui Zhao , Garin Kessler View a PDF of the paper titled Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation, by Yuchen Li and Amanmeet Garg and Shalini Chaudhuri and Rui Zhao and Garin Kessler View PDF HTML Abstract: Large Vision Language Models excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs. Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI) Cite as: arX...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine