How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective
#Vision-Language Models #Embodied Agents #NativeEmbodied #Artificial Intelligence #Benchmark #Foundational Skills #Low-level Action Space #Real-world Control
📌 Key Takeaways
- Researchers introduced NativeEmbodied, a new benchmark for VLM-driven embodied agents
- Existing benchmarks fail to accurately assess performance in real-world control scenarios
- The benchmark includes both high-level tasks and low-level tasks for comprehensive evaluation
- Experiments revealed deficiencies in fundamental embodied skills that limit overall performance
📖 Full Retelling
Researchers led by Bo Peng and nine collaborators introduced NativeEmbodied, a new benchmark for vision-language model (VLM)-driven embodied agents, on February 24, 2026, addressing critical limitations in current evaluation methods that fail to accurately assess performance in real-world control scenarios. The NativeEmbodied benchmark represents a significant advancement in embodied AI evaluation by utilizing a unified, native low-level action space that more closely resembles how humans interact with the physical world. Unlike existing benchmarks that rely on high-level commands or discretized action spaces, NativeEmbodied is built on diverse simulated scenes and includes three representative high-level tasks in complex scenarios to evaluate overall performance. The researchers further decouple the skills required by complex tasks and construct four types of low-level tasks, each targeting a fundamental embodied skill, enabling fine-grained assessment across different granularities.
🏷️ Themes
Artificial Intelligence, Benchmark Development, Embodied Intelligence
📚 Related People & Topics
Artificial intelligence
Intelligence of machines
# Artificial Intelligence (AI) **Artificial Intelligence (AI)** is a specialized field of computer science dedicated to the development and study of computational systems capable of performing tasks typically associated with human intelligence. These tasks include learning, reasoning, problem-solvi...
Entity Intersection Graph
Connections for Benchmark:
🌐
Large language model
3 shared
🌐
Building information modeling
1 shared
🏢
Digital transformation
1 shared
🌐
Construction
1 shared
🌐
Coordination failure
1 shared
Mentioned Entities
Original Source
--> Computer Science > Artificial Intelligence arXiv:2602.20687 [Submitted on 24 Feb 2026] Title: How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective Authors: Bo Peng , Pi Bu , Keyu Pan , Xinrun Xu , Yinxiu Zhao , Miao Chen , Yang Du , Lin Li , Jun Song , Tong Xu View a PDF of the paper titled How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective, by Bo Peng and 9 other authors View PDF HTML Abstract: Recent advances in vision-language models have shown promise for human-level embodied intelligence. However, existing benchmarks for VLM-driven embodied agents often rely on high-level commands or discretized action spaces, which are non-native settings that differ markedly from real-world control. In addition, current benchmarks focus primarily on high-level tasks and lack joint evaluation and analysis at both low and high levels. To address these limitations, we present NativeEmbodied, a challenging benchmark for VLM-driven embodied agents that uses a unified, native low-level action space. Built on diverse simulated scenes, NativeEmbodied includes three representative high-level tasks in complex scenarios to evaluate overall performance. For more detailed analysis, we further decouple the skills required by complex tasks and construct four types of low-level tasks, each targeting a fundamental embodied skill. This joint evaluation across task and skill granularities enables fine-grained assessment of embodied agents. Experiments with state-of-the-art VLMs reveal clear deficiencies in several fundamental embodied skills, and further analysis shows that these bottlenecks significantly limit performance on high-level tasks. NativeEmbodied highlights key challenges for current VLM-driven embodied agents and provides insights to guide future research. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.20687 [cs.AI] (or arXiv:2602.20687v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.20687 ...
Read full article at source