2/20/2026 | USA | technology | ✓ Verified - arxiv.org

MALLVI: a multi agent framework for integrated generalized robotics manipulation

#Multi‑Agent #Large Language Model #Vision‑Language Model #Robot Manipulation #Closed‑Loop Feedback #Zero‑Shot Learning #Decomposer #Localizer #Thinker #Reflector #Descriptor #Task Planning #Perception #Reasoning #Generalization #Error Recovery

📌 Key Takeaways

MALLVI is a modular multi‑agent architecture that includes Decomposer, Localizer, Thinker, Reflector, and optionally a Descriptor agent for visual memory.
The framework takes a natural‑language instruction and an environment image to produce atomic robot actions that can be executed by a manipulator.
After each action, a Vision‑Language Model evaluates the environment and decides whether to repeat or proceed, enabling closed‑loop control.
The Reflector agent enables targeted error detection and recovery by reactivating only relevant subsystems, avoiding a full system restart.
Experimental results in simulation and real‑world setups show that iterative multi‑agent coordination improves generalization and increases success rates in zero‑shot manipulation scenarios.
MALLVI operates without specialized model fine‑tuning or prompt tuning, relying instead on coordinated reasoning among distinct agents.

📖 Full Retelling

The authors Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani, and Babak Khalaj have introduced *MALLVI*, a Multi-Agent Large Language and Vision framework for integrated generalized robotics manipulation. Published on arXiv on 18 February 2026, the work presents a closed‑loop system that interprets natural‑language instructions and images to generate executable robot actions, evaluates environmental feedback using a Vision‑Language Model, and coordinates specialized agents for perception, planning, and error recovery—aiming to enhance robustness and generalizability in dynamic manipulation tasks.

🏷️ Themes

Robotics Manipulation, Multi‑Agent Systems, Large Language Models, Closed‑Loop Control, Perception and Vision‑Language Integration, Task Planning and Reasoning, Zero‑Shot Generalization

Entity Intersection Graph

No entity connections available yet for this article.

Deep Analysis

Why It Matters

MALLVI introduces a closed-loop multi-agent system that improves robotic manipulation by integrating language, vision, and planning. This approach reduces failure rates in dynamic environments and enables zero-shot task execution.

Context & Background

Robotic manipulation traditionally relies on specialized models or open-loop control
Large language models have been used for task planning but lack environmental feedback
MALLVI coordinates separate agents for perception, reasoning, and action
Simulation and real-world tests show higher success rates
The framework supports visual memory via a Descriptor agent

What Happens Next

Future work will focus on scaling MALLVI to multi-robot teams and integrating more advanced vision-language models. Wider adoption could standardize closed-loop manipulation pipelines in industry.

Frequently Asked Questions

What makes MALLVI different from single-model approaches?

It uses specialized agents that communicate, allowing targeted error recovery and efficient resource use.

Is MALLVI ready for commercial deployment?

Current experiments are in simulation and small-scale real-world tests; larger deployments will require further validation.

Can MALLVI handle arbitrary tasks?

It is designed for zero-shot manipulation, meaning it can perform new tasks given natural language instructions and visual input without retraining.

Original Source

              --> Computer Science > Robotics arXiv:2602.16898 [Submitted on 18 Feb 2026] Title: MALLVI: a multi agent framework for integrated generalized robotics manipulation Authors: Iman Ahmadi , Mehrshad Taji , Arad Mahdinezhad Kashani , AmirHossein Jadidi , Saina Kashani , Babak Khalaj View a PDF of the paper titled MALLVI: a multi agent framework for integrated generalized robotics manipulation, by Iman Ahmadi and 5 other authors View PDF HTML Abstract: Task planning for robotic manipulation with large language models is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic this http URL present MALLVi, a Multi Agent Large Language and Vision framework that enables closed loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVi generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model evaluates environmental feedback and decides whether to repeat the process or proceed to the next this http URL than using a single model, MALLVi coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full this http URL in simulation and real world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation this http URL available at this https URL . Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2602.16898 [cs.RO] (or arXiv:2602.16898v1 [cs.RO] for this version) https://doi...
            

Read full article at source

Source

arxiv.org