Точка Синхронізації

AI Archive of Human History

F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare
| USA | technology

F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

#F-GRPO #Reinforcement Learning #RLVR #arXiv #Large Language Models #Group Relative Policy Optimization #Algorithm Stability

📌 Key Takeaways

  • Researchers introduced F-GRPO to fix a bias in Reinforcement Learning that favors common but potentially inferior solutions.
  • The study highlights how small group sizes in RLVR often miss 'rare-correct' trajectories, leading to degraded model performance.
  • Standard GRPO methods face computational bottlenecks that F-GRPO aims to overcome through improved advantage estimation.
  • The proposed method helps Large Language Models retain complex reasoning capabilities during the fine-tuning phase.

📖 Full Retelling

A team of AI researchers published a new technical paper on the arXiv preprint server on February 11, 2025, introducing Filtered Group Relative Policy Optimization (F-GRPO) to address critical sampling failures in Reinforcement Learning with Verifiable Rewards (RLVR). The researchers identified a systemic flaw where standard group-sampling methods cause Large Language Models to over-prioritize common, obvious solutions while neglecting rarer but correct reasoning paths. By refining how rewards are calculated within small sample groups, the authors aim to prevent models from 'forgetting' complex or infrequent logic during the fine-tuning process. The core of the problem lies in the computational limitations of Group Relative Policy Optimization (GRPO), a popular technique used in training models like DeepSeek-R1. When group sizes are restricted due to hardware constraints, the mathematical estimation of 'advantage' becomes skewed. If a group of generated responses contains mostly common errors and only a few rare, high-quality solutions, the training algorithm may fail to provide enough signal to reinforce those rare successes. Consequently, the model begins to converge on mediocre, frequent patterns while losing the ability to generate sophisticated, out-of-distribution reasoning. To bridge this gap, the F-GRPO framework introduces a filtered approach to group sampling that ensures rare-correct trajectories are not overshadowed by the statistical noise of mixed-quality responses. The researchers derived new probability calculations to better estimate advantages when operating under hardware-constrained environments. This development is significant for the broader AI field, as it suggests that smaller, more efficient training runs can achieve state-of-the-art reasoning performance if the underlying policy optimization algorithm is mathematically tuned to value quality over sheer frequency.

🏷️ Themes

Artificial Intelligence, Machine Learning, Research

📚 Related People & Topics

Policy gradient method

Class of reinforcement learning algorithms

Policy gradient methods are a class of reinforcement learning algorithms. Policy gradient methods are a sub-class of policy optimization methods. Unlike value-based methods which learn a value function to derive a policy, policy optimization methods directly learn a policy function ...

Wikipedia →

Large language model

Type of machine learning model

A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the c...

Wikipedia →

Reinforcement learning

Reinforcement learning

Field of machine learning

In machine learning and optimal control, reinforcement learning (RL) is concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learnin...

Wikipedia →

📄 Original Source Content
arXiv:2602.06717v1 Announce Type: cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, large group sizes are not feasible due to computational limits, which biases learning toward trajectories that are already likely. Smaller groups often miss rare-correct trajectories while still containing mixed rewards, concentrating probability on common solutions. We derive the probability

Original source

More from USA

News from Other Countries

🇵🇱 Poland

🇬🇧 United Kingdom

🇺🇦 Ukraine

🇮🇳 India