SP
BravenNow
Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning
| USA | technology | βœ“ Verified - arxiv.org

Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning

#Reinforcement Learning #Multi-Modal Language Models #RLVR #CalibRL #Machine Learning Research #AI Exploration #ICLR 2026 #arXiv

πŸ“Œ Key Takeaways

  • CalibRL framework addresses exploration challenges in RLVR training for multi-modal language models
  • Distribution-aware advantage weighting scales updates by group rareness to calibrate the distribution
  • Asymmetric activation function uses expert knowledge as a calibration baseline
  • Demonstrated consistent improvements across eight benchmarks in both in-domain and out-of-domain settings
  • Code for the implementation is publicly available

πŸ“– Full Retelling

Researchers Zhuoxu Huang, Mengxi Jia, Hao Sun, Xuelong Li, and Jungong Han introduced CalibRL, a novel hybrid-policy RLVR framework designed to address exploration challenges in multi-modal large language models, in a paper submitted to arXiv on February 22, 2026. The research tackles critical issues in Reinforcement Learning with verifiable rewards (RLVR) training, where the enormous state space of MLLMs and sparse rewards often lead to entropy collapse, policy degradation, or over-exploitation of suboptimal behaviors. The CalibRL framework implements controllable exploration with expert guidance through two key mechanisms that maintain productive stochasticity while avoiding the inefficiencies of uncontrolled random sampling. The first mechanism, distribution-aware advantage weighting, scales updates by group rareness to calibrate the distribution and preserve exploration, while the asymmetric activation function leverages expert knowledge as a calibration baseline to moderate overconfident updates while maintaining their corrective direction. This approach increases policy entropy in a guided manner and clarifies the target distribution by estimating the on-policy distribution through online sampling, preventing convergence to erroneous patterns. The researchers validated their framework through extensive experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrating consistent improvements and achieving a more stable balance between exploration and exploitation.

🏷️ Themes

Machine Learning, Reinforcement Learning, Multi-Modal Reasoning, AI Research

πŸ“š Related People & Topics

Reinforcement learning

Reinforcement learning

Field of machine learning

In machine learning and optimal control, reinforcement learning (RL) is concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learnin...

View Profile β†’ Wikipedia β†—

Science Publishing Group

Predatory publisher

Science Publishing Group (SPG), also known as SciencePG, is a predatory publisher of open-access academic journals and books established in 2012. It has an address in New York City and many of its journals are named American Journal of..., but the company is actually based in Pakistan. The company h...

View Profile β†’ Wikipedia β†—

Entity Intersection Graph

Connections for Reinforcement learning:

🌐 Large language model 8 shared
🌐 Artificial intelligence 6 shared
🌐 Machine learning 4 shared
🌐 Reasoning model 2 shared
🌐 Educational technology 2 shared
View full profile
Original Source
--> Computer Science > Machine Learning arXiv:2602.20197 [Submitted on 22 Feb 2026] Title: Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning Authors: Zhuoxu Huang , Mengxi Jia , Hao Sun , Xuelong Li , Jungong Han View a PDF of the paper titled Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning, by Zhuoxu Huang and 4 other authors View PDF HTML Abstract: Reinforcement Learning with verifiable rewards has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi-modal large language models . However, during RL training, the enormous state space of MLLM and sparse rewards often leads to entropy collapse, policy degradation, or over-exploitation of suboptimal behaviors. This necessitates an exploration strategy that maintains productive stochasticity while avoiding the drawbacks of uncontrolled random sampling, yielding inefficient exploration. In this paper, we propose CalibRL, a hybrid-policy RLVR framework that supports controllable exploration with expert guidance, enabled by two key mechanisms. First, a distribution-aware advantage weighting scales updates by group rareness to calibrate the distribution, therefore preserving exploration. Meanwhile, the asymmetric activation function leverages the expert knowledge as a calibration baseline to moderate overconfident updates while preserving their corrective direction. CalibRL increases policy entropy in a guided manner and clarifies the target distribution by estimating the on-policy distribution through online sampling. Updates are driven by these informative behaviors, avoiding convergence to erroneous patterns. Importantly, these designs help alleviate the distributional mismatch between the model's policy and expert trajectories, thereby achieving a more stable balance between exploration and exploitation. Extensive experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvement...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

πŸ‡¬πŸ‡§ United Kingdom

πŸ‡ΊπŸ‡¦ Ukraine