Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
#Vision-Language-Action models #robotic manipulation #OptimusVLA #dual-memory framework #Global Prior Memory #Local Consistency Memory #inference efficiency #temporal consistency
📌 Key Takeaways
- Researchers developed OptimusVLA, a dual-memory framework enhancing robotic manipulation efficiency
- The model addresses critical bottlenecks in existing Vision-Language-Action systems
- OptimusVLA achieves superior performance across multiple simulation benchmarks
- Real-world tests confirm the model's effectiveness with significant speed improvements
📖 Full Retelling
Researchers led by Zaijing Li and colleagues announced the development of OptimusVLA, a dual-memory Vision-Language-Action framework designed to enhance robotic manipulation efficiency, in a paper submitted to arXiv on February 22, 2026. The new approach addresses critical limitations in existing hierarchical Vision-Language-Action models that have become dominant for robotic manipulation but are increasingly bottlenecked by action generation processes. The researchers identified two main problems with current robotic manipulation models: low inference efficiency due to a distributional gap between isotropic noise priors and target action distributions, and poor robustness stemming from policies that condition solely on current observations while neglecting historical context. OptimusVLA introduces two innovative memory components to solve these issues. The Global Prior Memory (GPM) replaces traditional Gaussian noise with task-level priors retrieved from semantically similar trajectories, significantly reducing the generative path and decreasing the number of function evaluations. Meanwhile, the Local Consistency Memory (LCM) dynamically models executed action sequences to infer task progress and injects consistency constraints that enforce temporal coherence and trajectory smoothness. In extensive testing across three simulation benchmarks, OptimusVLA demonstrated superior performance compared to existing approaches, achieving a 98.6% average success rate on LIBERO, improving over previous models by 13.5% on CALVIN, and attaining a 38% average success rate on RoboTwin 2.0 Hard. Real-world evaluations further confirmed the model's capabilities, with OptimusVLA ranking highest on Generalization and Long-horizon test suites, surpassing previous models by 42.9% and 52.4% respectively, while delivering a remarkable 2.9x inference speedup.
🏷️ Themes
Robotics, Artificial Intelligence, Computer Vision
Entity Intersection Graph
No entity connections available yet for this article.
Original Source
--> Computer Science > Robotics arXiv:2602.20200 [Submitted on 22 Feb 2026] Title: Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation Authors: Zaijing Li , Bing Hu , Rui Shao , Gongwei Chen , Dongmei Jiang , Pengwei Xie , Jianye Hao , Liqiang Nie View a PDF of the paper titled Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation, by Zaijing Li and 7 other authors View PDF HTML Abstract: Hierarchical Vision-Language-Action models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory and Local Consistency Memory . GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations . LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6% average success rate on LIBERO, improves over pi_0 by 13.5% on CALVIN, and attains 38% average success rate on RoboTwin 2.0 H...
Read full article at source