Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation
#Perceptio #vision-language models #spatial token generation #perception enhancement #multimodal AI
📌 Key Takeaways
- Perceptio introduces spatial token generation to enhance vision-language models.
- The method improves perception capabilities in multimodal AI systems.
- It focuses on generating spatial tokens for better visual understanding.
- The approach aims to bridge gaps between visual data and language processing.
📖 Full Retelling
arXiv:2603.18795v1 Announce Type: cross
Abstract: Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we
🏷️ Themes
AI Enhancement, Multimodal Models
Entity Intersection Graph
No entity connections available yet for this article.
Original Source
--> Computer Science > Computer Vision and Pattern Recognition arXiv:2603.18795 [Submitted on 19 Mar 2026] Title: Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation Authors: Yuchen Li , Amanmeet Garg , Shalini Chaudhuri , Rui Zhao , Garin Kessler View a PDF of the paper titled Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation, by Yuchen Li and Amanmeet Garg and Shalini Chaudhuri and Rui Zhao and Garin Kessler View PDF HTML Abstract: Large Vision Language Models excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs. Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI) Cite as: arX...
Read full article at source