An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction
#Human-Robot Interaction #Multimodal AI #Vision-Language Models #Large Language Models #Fuzzy Logic #Dobot Magician #Florence-2 #Llama 3.1
📌 Key Takeaways
- Researchers developed a multimodal HRI framework combining vision-language models, speech processing, and fuzzy logic
- The system integrates Florence-2, Llama 3.1, and Whisper for object detection, language understanding, and speech recognition
- Experimental tests showed 75% command execution accuracy on consumer-grade hardware
- The architecture provides a flexible foundation for future HRI research
- The approach enables more natural and intuitive human-robot collaboration
📖 Full Retelling
Researchers Guanting Shen and Zi Tian introduced a novel multimodal human-robot interaction framework on February 23, 2026, that combines vision-language models, speech processing, and fuzzy logic to enable more intuitive control of robotic systems, addressing the fundamental challenge of accurately interpreting human intent in human-machine collaboration. The proposed system integrates Florence-2 for object detection, Llama 3.1 for natural language understanding, and Whisper for speech recognition, creating a seamless interface for users to manipulate objects through spoken commands. By jointly addressing scene perception and action planning, the approach enhances the reliability of command interpretation and execution. Experimental evaluations conducted on consumer-grade hardware revealed a command execution accuracy of 75%, highlighting both the robustness and adaptability of the system. Beyond its current performance, the proposed architecture serves as a flexible and extensible foundation for future human-robot interaction research, potentially transforming how humans interact with robotic systems in various environments.
🏷️ Themes
Human-Robot Interaction, Multimodal AI Systems, Natural Language Processing, Robotics Technology
Entity Intersection Graph
No entity connections available yet for this article.
Original Source
--> Computer Science > Robotics arXiv:2602.20219 [Submitted on 23 Feb 2026] Title: An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction Authors: Guanting Shen , Zi Tian View a PDF of the paper titled An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction, by Guanting Shen and 1 other authors View PDF HTML Abstract: Interpreting human intent accurately is a central challenge in human-robot interaction and a key requirement for achieving more natural and intuitive collaboration between humans and machines. This work presents a novel multimodal HRI framework that combines advanced vision-language models, speech processing, and fuzzy logic to enable precise and adaptive control of a Dobot Magician robotic arm. The proposed system integrates Florence-2 for object detection, Llama 3.1 for natural language understanding, and Whisper for speech recognition, providing users with a seamless and intuitive interface for object manipulation through spoken commands. By jointly addressing scene perception and action planning, the approach enhances the reliability of command interpretation and execution. Experimental evaluations conducted on consumer-grade hardware demonstrate a command execution accuracy of 75\%, highlighting both the robustness and adaptability of the system. Beyond its current performance, the proposed architecture serves as a flexible and extensible foundation for future HRI research, offering a practical pathway toward more sophisticated and natural human-robot collaboration through tightly coupled speech and vision-language processing. Comments: Preprint currently under revision Subjects: Robotics (cs.RO) ; Artificial Intelligence (cs.AI) Cite as: arXiv:2602.20219 [cs.RO] (or arXiv:2602.20219v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2602.20219 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Guanting Shen [ view email ] [v1] Mon, 23 ...
Read full article at source