MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents
#Theory-of-Mind #Vision-Language Models #Embodied Agents #MindPower #Mental Reasoning #CVPR 2026 #GPT-4o #Multimodal AI
📌 Key Takeaways
- MindPower enables Theory-of-Mind reasoning in vision-language embodied agents
- The framework integrates Perception, Mental Reasoning, Decision Making, and Action components
- Researchers introduced Mind-Reward optimization for consistent mental reasoning and behavior
- MindPower outperforms GPT-4o by over 12% in decision making and action generation
📖 Full Retelling
Researchers led by Ruoxuan Zhang along with 9 other authors have developed MindPower, a novel framework that enables Theory-of-Mind reasoning in vision-language embodied agents, addressing the current limitation in AI's ability to understand mental states of others. The research paper, submitted to arXiv on November 28, 2025 and revised on February 24, 2026, has been accepted for presentation at CVPR 2026, highlighting its significance in the field of artificial intelligence. The team introduces this Robot-Centric framework to overcome the gap in existing benchmarks that focus solely on human mental states while ignoring the agent's own perspective, which hinders coherent decision-making and action generation in AI systems.
MindPower represents a comprehensive approach to creating more sophisticated embodied agents by integrating four key components: Perception, Mental Reasoning, Decision Making, and Action. When processing multimodal inputs, the system first perceives both the environment and human states, then performs Theory-of-Mind reasoning to model both self and others, and finally generates decisions and actions guided by these inferred mental states. This holistic approach enables AI systems to consider not only what humans are doing but also what they might be thinking or intending to do, creating more natural and contextually appropriate interactions.
The researchers also introduced Mind-Reward, a novel optimization objective that encourages Vision-Language Models (VLMs) to produce consistent Theory-of-Mind reasoning and behavior. This innovation addresses the challenge of aligning internal mental models with external actions, a critical aspect of creating believable and effective AI agents. In performance evaluations, the MindPower framework demonstrated significant improvements, outperforming even advanced models like GPT-4o by 12.77% in decision making and 12.49% in action generation tasks. These results underscore the potential of Theory-of-Mind capabilities to revolutionize how AI systems interact with humans in complex environments, paving the way for more intuitive and context-aware robotic assistants and virtual agents.
🏷️ Themes
Artificial Intelligence, Human-Computer Interaction, Cognitive Computing
📚 Related People & Topics
Multimodal learning
Machine learning methods using multiple input modalities
Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question...
Entity Intersection Graph
Connections for Multimodal learning:
🌐
Clip
2 shared
🏢
TabPFN
1 shared
🌐
Machine learning
1 shared
🌐
Reinforcement learning
1 shared
🌐
Computer vision
1 shared
Original Source
--> Computer Science > Artificial Intelligence arXiv:2511.23055 [Submitted on 28 Nov 2025 ( v1 ), last revised 24 Feb 2026 (this version, v2)] Title: MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents Authors: Ruoxuan Zhang , Qiyun Zheng , Zhiyu Zhou , Ziqi Liao , Siyu Wu , Jian-Yu Jiang-Lin , Bin Wen , Hongxia Xie , Jianlong Fu , Wen-Huang Cheng View a PDF of the paper titled MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents, by Ruoxuan Zhang and 9 other authors View PDF HTML Abstract: Theory of Mind refers to the ability to infer others' mental states, such as beliefs, desires, and intentions. Current vision-language embodied agents lack ToM-based decision-making, and existing benchmarks focus solely on human mental states while ignoring the agent's own perspective, hindering coherent decision and action generation. To address this, we propose MindPower, a Robot-Centric framework integrating Perception, Mental Reasoning, Decision Making and Action. Given multimodal inputs, MindPower first perceives the environment and human states, then performs ToM Reasoning to model both self and others, and finally generates decisions and actions guided by inferred mental states. Furthermore, we introduce Mind-Reward, a novel optimization objective that encourages VLMs to produce consistent ToM Reasoning and behavior. Our model outperforms GPT-4o by 12.77% in decision making and 12.49% in action generation. Comments: Accepted by CVPR 2026 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2511.23055 [cs.AI] (or arXiv:2511.23055v2 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.23055 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Ruoxuan Zhang [ view email ] [v1] Fri, 28 Nov 2025 10:24:44 UTC (20,668 KB) [v2] Tue, 24 Feb 2026 00:57:43 UTC (20,668 KB) Full-text links: Access Paper: View a PDF of the paper titled MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents, ...
Read full article at source