Researchers developed a two-stage framework to address overthinking in large reasoning models
The approach combines Hybrid Fine-Tuning with adaptive reinforcement learning techniques
Experimental results show significant accuracy improvements with reduced computational costs
The method demonstrates robustness across varying problem difficulties and out-of-distribution tasks
📖 Full Retelling
Researchers Zihang Xu, Haozhi Xie, Ziqi Miao, Wuxuan Gong, Chen Qian, and Lijun Li published a groundbreaking paper on February 26, 2026, on arXiv introducing a two-stage framework for stable adaptive thinking in large reasoning models (LRMs). The research addresses the persistent issue of overthinking behavior in LRMs when handling low-complexity queries, which has limited the effectiveness of existing solutions through unstable accuracy-efficiency trade-offs and poor robustness to diverse reasoning behaviors. The paper, titled 'Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation,' introduces an innovative approach to enhance the performance of large reasoning models that typically achieve strong results through extended reasoning traces but often struggle with unnecessary complexity for simpler problems. The proposed framework first employs Hybrid Fine-Tuning to expose models to both thinking and no-thinking behaviors, establishing well-conditioned initialization, followed by adaptive reinforcement learning with Correctness-Preserving Advantage Shaping to prevent suppressing correct long-chain reasoning, and Length-Aware Gradient Regulation to stabilize optimization when dealing with varying reasoning lengths.
A reasoning model, also known as reasoning language models (RLMs) or large reasoning models (LRMs), is a type of large language model (LLM) that has been specifically trained to solve complex tasks requiring multiple steps of logical reasoning. These models demonstrate superior performance on logic,...
In machine learning and optimal control, reinforcement learning (RL) is concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learnin...
--> Computer Science > Machine Learning arXiv:2602.22556 [Submitted on 26 Feb 2026] Title: Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation Authors: Zihang Xu , Haozhi Xie , Ziqi Miao , Wuxuan Gong , Chen Qian , Lijun Li View a PDF of the paper titled Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation, by Zihang Xu and 5 other authors View PDF HTML Abstract: Large reasoning models achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries. Existing efforts to mitigate this issue are fundamentally limited by unstable accuracy-efficiency trade-offs and poor robustness to heterogeneous reasoning behaviors. To address these challenges, we propose a two-stage framework for stable adaptive thinking in LRMs. The framework first applies Hybrid Fine-Tuning to expose the model to both thinking and no-thinking behaviors, establishing well-conditioned initialization. It then performs adaptive reinforcement learning with Correctness-Preserving Advantage Shaping to avoid suppressing correct long-chain reasoning, and Length-Aware Gradient Regulation to stabilize optimization under severe reasoning-length heterogeneity. Extensive experiments on Qwen2.5-1.5B and 7B show consistent improvements over strong baselines, achieving up to +3.7/+3.6 accuracy points while reducing generated tokens by 40.6%/43.9%. Further analyses across varying problem difficulties and out-of-distribution tasks confirm the robustness and generalization of our approach. Comments: 15 pages, 7 figures Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2602.22556 [cs.LG] (or arXiv:2602.22556v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.22556 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Miao Ziqi [ view email ] [v1] Thu, 26 Feb 2026 02:...