F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare
#F-GRPO #Reinforcement Learning #RLVR #arXiv #Large Language Models #Group Relative Policy Optimization #Algorithm Stability
📌 Key Takeaways
- Researchers introduced F-GRPO to fix a bias in Reinforcement Learning that favors common but potentially inferior solutions.
- The study highlights how small group sizes in RLVR often miss 'rare-correct' trajectories, leading to degraded model performance.
- Standard GRPO methods face computational bottlenecks that F-GRPO aims to overcome through improved advantage estimation.
- The proposed method helps Large Language Models retain complex reasoning capabilities during the fine-tuning phase.
📖 Full Retelling
🏷️ Themes
Artificial Intelligence, Machine Learning, Research
📚 Related People & Topics
Policy gradient method
Class of reinforcement learning algorithms
Policy gradient methods are a class of reinforcement learning algorithms. Policy gradient methods are a sub-class of policy optimization methods. Unlike value-based methods which learn a value function to derive a policy, policy optimization methods directly learn a policy function ...
Large language model
Type of machine learning model
A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the c...
Reinforcement learning
Field of machine learning
In machine learning and optimal control, reinforcement learning (RL) is concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learnin...
📄 Original Source Content
arXiv:2602.06717v1 Announce Type: cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, large group sizes are not feasible due to computational limits, which biases learning toward trajectories that are already likely. Smaller groups often miss rare-correct trajectories while still containing mixed rewards, concentrating probability on common solutions. We derive the probability