SP
BravenNow
PyVision-RL: Forging Open Agentic Vision Models via RL
| USA | technology | ✓ Verified - arxiv.org

PyVision-RL: Forging Open Agentic Vision Models via RL

#PyVision-RL #Reinforcement Learning #Agentic Models #Multimodal AI #Interaction Collapse #Computer Vision #Video Understanding #Open-weight Models

📌 Key Takeaways

  • PyVision-RL prevents interaction collapse in multimodal reinforcement learning systems
  • The framework combines oversampling-filtering-ranking with accumulative tool rewards
  • Researchers developed PyVision-Image and PyVision-Video models for different modalities
  • PyVision-Video uses on-demand context construction for efficient frame selection
  • Sustained interaction and on-demand processing are critical for scalable multimodal agents

📖 Full Retelling

Researchers led by Shitian Zhao and 6 collaborators introduced PyVision-RL, a reinforcement learning framework for open-weight multimodal models, on the arXiv preprint server on February 24, 2026, addressing the critical issue of interaction collapse where models reduce tool usage and multi-turn reasoning, thereby limiting the benefits of agentic behavior. The researchers developed PyVision-RL to stabilize training and sustain interaction in multimodal models by combining an innovative oversampling-filtering-ranking rollout strategy with an accumulative tool reward system. This approach specifically targets the tendency of reinforcement learning systems to collapse into simplified behaviors that avoid complex multi-turn interactions and tool usage. Using a unified training pipeline, the team developed two specialized models: PyVision-Image for image understanding and PyVision-Video for video reasoning tasks. For video processing, PyVision-Video employs an innovative on-demand context construction method that selectively samples task-relevant frames during reasoning, significantly reducing visual token usage while maintaining performance. Experimental results demonstrate that PyVision-RL achieves strong performance metrics with improved computational efficiency, proving that sustained interaction and selective visual processing are essential components for developing scalable multimodal agents capable of handling complex real-world tasks.

🏷️ Themes

Artificial Intelligence, Reinforcement Learning, Computer Vision, Multimodal Models

📚 Related People & Topics

Reinforcement learning

Reinforcement learning

Field of machine learning

In machine learning and optimal control, reinforcement learning (RL) is concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learnin...

View Profile → Wikipedia ↗

Multimodal learning

Machine learning methods using multiple input modalities

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question...

View Profile → Wikipedia ↗

Computer vision

Computerized information extraction from images

Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the form of decisions. "Understanding" in this context signifies th...

View Profile → Wikipedia ↗

Entity Intersection Graph

Connections for Reinforcement learning:

🌐 Large language model 8 shared
🌐 Artificial intelligence 6 shared
🌐 Machine learning 4 shared
🏢 Science Publishing Group 2 shared
🌐 Reasoning model 2 shared
View full profile
Original Source
--> Computer Science > Artificial Intelligence arXiv:2602.20739 [Submitted on 24 Feb 2026] Title: PyVision-RL: Forging Open Agentic Vision Models via RL Authors: Shitian Zhao , Shaoheng Lin , Ming Li , Haoquan Zhang , Wenshuo Peng , Kaipeng Zhang , Chen Wei View a PDF of the paper titled PyVision-RL: Forging Open Agentic Vision Models via RL, by Shitian Zhao and 6 other authors View PDF Abstract: Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents. Comments: preprint Subjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.20739 [cs.AI] (or arXiv:2602.20739v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.20739 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shitian Zhao [ view email ] [v1] Tue, 24 Feb 2026 10:08:33 UTC (26,520 KB) Full-text links: Access Paper: View a PDF of the paper titled PyVision-RL: Forging Open Agentic Vision Models via RL, by Shitian Zhao and 6 other authors View PDF TeX Source view license Current browse contex...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine