SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs
#SPARC #Vision-Language Models #Test-time scaling #Chain-of-thought #Perception error #Reasoning circuits #arXiv #Deep learning
📌 Key Takeaways
- Test-time scaling in current Vision-Language Models is currently unstable due to unstructured chains-of-thought.
- The SPARC framework introduces a formal separation between perception and reasoning to prevent error cascades.
- Traditional VLM training often requires expensive reinforcement learning with labor-intensive hand-crafted rewards.
- Entangling visual perception and logical reasoning leads to disorganized outputs where small early mistakes ruin the final answer.
📖 Full Retelling
🏷️ Themes
Artificial Intelligence, Computer Vision, Machine Learning
📚 Related People & Topics
Deep learning
Branch of machine learning
In machine learning, deep learning focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience and revolves around stacking artificial neurons into layers and "training" t...
SPARC
RISC instruction set architecture
SPARC (Scalable Processor ARChitecture) is a reduced instruction set computer (RISC) instruction set architecture originally developed by Sun Microsystems. Its design was strongly influenced by the experimental Berkeley RISC system developed in the early 1980s. First developed in 1986 and released i...
🔗 Entity Intersection Graph
Connections for Deep learning:
- 🌐 Neural network (4 shared articles)
- 🌐 Medical imaging (2 shared articles)
- 🌐 MLP (2 shared articles)
- 🌐 CSI (1 shared articles)
- 🌐 Generative adversarial network (1 shared articles)
- 🌐 Pipeline (computing) (1 shared articles)
- 🌐 Magnetic flux leakage (1 shared articles)
- 🌐 Computer vision (1 shared articles)
- 🌐 Hardware acceleration (1 shared articles)
- 🌐 Diagnosis (1 shared articles)
- 🌐 Explainable artificial intelligence (1 shared articles)
- 🌐 Attention (machine learning) (1 shared articles)
📄 Original Source Content
arXiv:2602.06566v1 Announce Type: cross Abstract: Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to a