3/6/2026 | USA | technology | ✓ Verified - arxiv.org

Context-Dependent Affordance Computation in Vision-Language Models

#vision-language models #affordance computation #context-dependent #AI perception #object interaction #functional understanding #robotics

📌 Key Takeaways

Vision-language models can compute affordances based on context, not just object recognition.
This capability allows AI to understand how objects can be used in different situations.
The research highlights a shift from static object identification to dynamic functional understanding.
Potential applications include robotics and assistive technologies for real-world interaction.

📖 Full Retelling

arXiv:2603.04419v1 Announce Type: cross Abstract: We characterize the phenomenon of context-dependent affordance computation in vision-language models (VLMs). Through a large-scale computational study (n=3,213 scene-context pairs from COCO-2017) using Qwen-VL 30B and LLaVA-1.5-13B subject to systematic context priming across 7 agentic personas, we demonstrate massive affordance drift: mean Jaccard similarity between context conditions is 0.095 (95% CI: [0.093, 0.096], p < 0.0001), indicating

🏷️ Themes

AI Perception, Contextual Understanding

Entity Intersection Graph

No entity connections available yet for this article.

}

Original Source

              --> Computer Science > Computation and Language arXiv:2603.04419 [Submitted on 14 Feb 2026] Title: Context-Dependent Affordance Computation in Vision-Language Models Authors: Murad Farzulla View a PDF of the paper titled Context-Dependent Affordance Computation in Vision-Language Models, by Murad Farzulla View PDF HTML Abstract: We characterize the phenomenon of context-dependent affordance computation in vision-language models . Through a large-scale computational study 3,213 scene-context pairs from COCO-2017) using Qwen-VL 30B and LLaVA-1.5-13B subject to systematic context priming across 7 agentic personas, we demonstrate massive affordance drift: mean Jaccard similarity between context conditions is 0.095 (95% CI: [0.093, 0.096], p < 0.0001), indicating that >90% of lexical scene description is context-dependent. Sentence-level cosine similarity confirms substantial drift at the semantic level 0.415, 58.5% context-dependent). Stochastic baseline experiments (2,384 inference runs across 4 temperatures and 5 seeds) confirm this drift reflects genuine context effects rather than generation noise: within-prime variance is substantially lower than cross-prime variance across all conditions. Tucker decomposition with bootstrap stability analysis 1,000 resamples) reveals stable orthogonal latent factors: a "Culinary Manifold" isolated to chef contexts and an "Access Axis" spanning child-mobility contrasts. These findings establish that VLMs compute affordances in a substantially context-dependent manner -- with the difference between lexical (90%) and semantic (58.5%) measures reflecting that surface vocabulary changes more than underlying meaning under context shifts -- and suggest a direction for robotics research: dynamic, query-dependent ontological projection (JIT Ontology) rather than static world modeling. We do not claim to establish processing order or architectural primacy; such claims require internal representational analysis beyond output behavior. Commen...
            

Read full article at source

Source

arxiv.org

Context-Dependent Affordance Computation in Vision-Language Models

📌 Key Takeaways

📖 Full Retelling

🏷️ Themes

Entity Intersection Graph

Source

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine