SP
BravenNow
Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning
| USA | technology | ✓ Verified - arxiv.org

Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

#Batch Adaptation Policy Optimization (BAPO) #Off-policy reinforcement learning #Large language models #Reinforcement Learning with Verifiable Rewards (RLVR) #Data efficiency #Machine learning reasoning #Post-training optimization

📌 Key Takeaways

  • Researchers developed BAPO, an off-policy RLVR framework to improve data efficiency in LLM post-training
  • BAPO dynamically selects training batches by re-evaluating difficult samples and reusing high-quality ones
  • The framework achieves 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks
  • BAPO resolves 40.7% of problems that base models consistently fail to solve

📖 Full Retelling

Researchers Xu Wan, Yansheng Wang, Wenqi Huang, and Mingyang Sun introduced Batch Adaptation Policy Optimization (BAPO), an off-policy Reinforcement Learning with Verifiable Rewards framework, on arXiv on February 24, 2026, to address experience waste and reward homogeneity issues that hinder learning efficiency in large language models during post-training. The traditional on-policy RLVR frameworks suffer from significant inefficiencies when dealing with difficult samples during large language model post-training. These frameworks tend to waste experience and exhibit reward homogeneity, which directly impacts learning outcomes. The researchers developed BAPO as a solution that dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while maintaining a lower bound guarantee for policy improvement. This approach represents a significant departure from conventional methods by focusing on data efficiency and targeted learning. Extensive experiments conducted by the team demonstrate that BAPO achieves an average 12.5% improvement over Group Relative Policy Optimization (GRPO) across mathematics, planning, and visual reasoning tasks. Most notably, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve, indicating its potential to overcome significant limitations in current language model capabilities.

🏷️ Themes

Artificial Intelligence, Machine Learning, Reinforcement Learning, Language Models

📚 Related People & Topics

Large language model

Type of machine learning model

A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the c...

View Profile → Wikipedia ↗

Data efficiency

Data efficiency refers to efficiency of the many processes that can be applied to data such as storage, access, filtering, sharing, etc., and whether or not the processes lead to the desired outcome within resource constraints. A management definition of data efficiency would be the measure of how d...

View Profile → Wikipedia ↗

Entity Intersection Graph

Connections for Large language model:

🌐 Educational technology 4 shared
🌐 Reinforcement learning 3 shared
🌐 Machine learning 2 shared
🌐 Artificial intelligence 2 shared
🌐 Benchmark 2 shared
View full profile
Original Source
--> Computer Science > Artificial Intelligence arXiv:2602.20722 [Submitted on 24 Feb 2026] Title: Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning Authors: Xu Wan , Yansheng Wang , Wenqi Huang , Mingyang Sun View a PDF of the paper titled Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning, by Xu Wan and 3 other authors View PDF HTML Abstract: Traditional on-policy Reinforcement Learning with Verifiable Rewards frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization , an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.20722 [cs.AI] (or arXiv:2602.20722v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.20722 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xu Wan [ view email ] [v1] Tue, 24 Feb 2026 09:35:43 UTC (11,938 KB) Full-text links: Access Paper: View a PDF of the paper titled Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning, by Xu Wan and 3 other authors View PDF HTML TeX Source view license Current browse context: cs.AI < prev | next > new | recent | 2026-02 Change to browse by: cs References & Citations NA...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine