SP
BravenNow
PyVision-RL: Forging Open Agentic Vision Models via RL
| USA | technology | βœ“ Verified - arxiv.org

PyVision-RL: Forging Open Agentic Vision Models via RL

#PyVision-RL #Reinforcement Learning #Agentic Models #Multimodal AI #Interaction Collapse #Computer Vision #Video Understanding #Open-weight Models

πŸ“Œ Key Takeaways

  • PyVision-RL prevents interaction collapse in multimodal reinforcement learning systems
  • The framework combines oversampling-filtering-ranking with accumulative tool rewards
  • Researchers developed PyVision-Image and PyVision-Video models for different modalities
  • PyVision-Video uses on-demand context construction for efficient frame selection
  • Sustained interaction and on-demand processing are critical for scalable multimodal agents

πŸ“– Full Retelling

Researchers led by Shitian Zhao and 6 collaborators introduced PyVision-RL, a reinforcement learning framework for open-weight multimodal models, on the arXiv preprint server on February 24, 2026, addressing the critical issue of interaction collapse where models reduce tool usage and multi-turn reasoning, thereby limiting the benefits of agentic behavior. The researchers developed PyVision-RL to stabilize training and sustain interaction in multimodal models by combining an innovative oversampling-filtering-ranking rollout strategy with an accumulative tool reward system. This approach specifically targets the tendency of reinforcement learning systems to collapse into simplified behaviors that avoid complex multi-turn interactions and tool usage. Using a unified training pipeline, the team developed two specialized models: PyVision-Image for image understanding and PyVision-Video for video reasoning tasks. For video processing, PyVision-Video employs an innovative on-demand context construction method that selectively samples task-relevant frames during reasoning, significantly reducing visual token usage while maintaining performance. Experimental results demonstrate that PyVision-RL achieves strong performance metrics with improved computational efficiency, proving that sustained interaction and selective visual processing are essential components for developing scalable multimodal agents capable of handling complex real-world tasks.

🏷️ Themes

Artificial Intelligence, Reinforcement Learning, Computer Vision, Multimodal Models

πŸ“š Related People & Topics

Reinforcement learning

Reinforcement learning

Field of machine learning

In machine learning and optimal control, reinforcement learning (RL) is concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learnin...

View Profile β†’ Wikipedia β†—

Multimodal learning

Machine learning methods using multiple input modalities

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question...

View Profile β†’ Wikipedia β†—

Computer vision

Computerized information extraction from images

Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the form of decisions. "Understanding" in this context signifies th...

View Profile β†’ Wikipedia β†—

Entity Intersection Graph

Connections for Reinforcement learning:

🌐 Large language model 10 shared
🌐 Artificial intelligence 8 shared
🌐 Machine learning 4 shared
🌐 AI agent 3 shared
🏒 Science Publishing Group 2 shared
View full profile

Mentioned Entities

Reinforcement learning

Reinforcement learning

Field of machine learning

Multimodal learning

Machine learning methods using multiple input modalities

Computer vision

Computerized information extraction from images

}
Original Source
--> Computer Science > Artificial Intelligence arXiv:2602.20739 [Submitted on 24 Feb 2026] Title: PyVision-RL: Forging Open Agentic Vision Models via RL Authors: Shitian Zhao , Shaoheng Lin , Ming Li , Haoquan Zhang , Wenshuo Peng , Kaipeng Zhang , Chen Wei View a PDF of the paper titled PyVision-RL: Forging Open Agentic Vision Models via RL, by Shitian Zhao and 6 other authors View PDF Abstract: Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents. Comments: preprint Subjects: Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.20739 [cs.AI] (or arXiv:2602.20739v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.20739 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shitian Zhao [ view email ] [v1] Tue, 24 Feb 2026 10:08:33 UTC (26,520 KB) Full-text links: Access Paper: View a PDF of the paper titled PyVision-RL: Forging Open Agentic Vision Models via RL, by Shitian Zhao and 6 other authors View PDF TeX Source view license Current browse contex...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

πŸ‡¬πŸ‡§ United Kingdom

πŸ‡ΊπŸ‡¦ Ukraine