SP
BravenNow
Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models
| USA | technology | ✓ Verified - arxiv.org

Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models

#Vision-Language Models #Hallucination #Spatial Credit Redistribution #Transformer #Computer Vision #AI Research

📌 Key Takeaways

  • Researchers developed SCR to reduce VLM hallucinations by redistributing activation credits
  • Method reduces hallucination by 4.7-6.0 percentage points on POPE-Adversarial benchmark
  • SCR achieves significant gains with minimal computational overhead of only 43-56 ms
  • Attention-guided source selection is essential for achieving maximum improvements

📖 Full Retelling

Researchers Niamul Hassan Samin, Md Arifur Rahman, Abdullah Ibne Hanif, Juena Ahmed Noshin, and Md Ashikur Rahman introduced Spatial Credit Redistribution (SCR), a training-free inference-time intervention to reduce hallucinations in vision-language models, in their paper submitted to arXiv on February 25, 2026. The research addresses the persistent issue where vision-language models frequently hallucinate objects that are not actually present in input images, a problem the authors trace to 'spatial credit collapse' - a phenomenon where activation credit concentrates on sparse visual patches in early transformer layers, suppressing contextual evidence and increasing reliance on language priors. The SCR method redistributes hidden-state activation from high-attention source patches to their context, guided by low-entropy inputs, demonstrating significant improvements across multiple model families and benchmarks while maintaining computational efficiency.

🏷️ Themes

Artificial Intelligence, Computer Vision, Machine Learning

📚 Related People & Topics

Hallucination

Hallucination

Perception that only seems real

A hallucination is a perception in the absence of an external context stimulus that has the compelling sense of reality. They are distinguishable from several related phenomena, such as dreaming (REM sleep), which does not involve wakefulness; pseudohallucination, which does not mimic real perceptio...

View Profile → Wikipedia ↗
Transformer

Transformer

Device to couple energy between circuits

In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer's core, which induces a varying ...

View Profile → Wikipedia ↗

Computer vision

Computerized information extraction from images

Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the form of decisions. "Understanding" in this context signifies th...

View Profile → Wikipedia ↗

Entity Intersection Graph

Connections for Hallucination:

🌐 Uncertainty quantification 1 shared
🌐 Computer vision 1 shared
🌐 Computational linguistics 1 shared
🌐 Large language model 1 shared
View full profile
Original Source
--> Computer Science > Computer Vision and Pattern Recognition arXiv:2602.22469 [Submitted on 25 Feb 2026] Title: Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models Authors: Niamul Hassan Samin , Md Arifur Rahman , Abdullah Ibne Hanif , Juena Ahmed Noshin , Md Ashikur Rahman View a PDF of the paper titled Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models, by Niamul Hassan Samin and 4 other authors View PDF HTML Abstract: Vision-language models frequently hallucinate objects absent from the input image. We trace this failure to spatial credit collapse: activation credit concentrating on sparse visual patches in early transformer layers, which suppresses contextual evidence and increases reliance on language priors. We introduce Spatial Credit Redistribution , a training-free inference-time intervention that redistributes hidden-state activation from high-attention source patches to their context, guided by low-entropy inputs. We evaluate six model families (Chameleon, LLaVA, and Qwen, including both Qwen-VL and Qwen2-VL) at scales of 7B, 13B, and 30B, on POPE and CHAIR benchmarks. SCR reduces hallucination by ~4.7-6.0 percentage points on POPE-Adversarial, cuts CHAIR-s by 3.7-5.2 percentage points (42-51 percent relative), and CHAIR-i by 2.7-4.4 percentage points (44-58 percent relative), and preserves CIDEr within 0.8 percentage points. Gains are largest for low-entropy inputs, consistent with the theoretical framework. SCR incurs only 43-56 ms overhead (small models: +43-46 ms; large models: +54-56 ms), roughly 3-6 times lower than OPERA and VCD and 1.3-1.7 times lower than OVCD (+72 ms), while Pareto-dominating all three on both hallucination rate and CIDEr, making it practical for real-time settings. A controlled ablation confirms that attention-guided source selection is essential: replacing it with uniform random selection reduces hallucination rate gains from ~4.7-6.0 percentage p...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine