Researchers identified a critical gap between strategy usage and strategy executability in mathematical reasoning
Human and AI strategies show systematic differences with complementary strengths
The proposed SSR framework selectively combines strategies based on executability
SSR achieved significant accuracy improvements across multiple benchmarks
๐ Full Retelling
Researchers Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song, and Kenji Kawaguchi published a groundbreaking paper on February 26, 2026, that addresses the instability in mathematical reasoning guidance systems, revealing that the effectiveness of example-based approaches varies significantly across problems and models due to an underexplored gap between strategy usage and strategy executability. The paper, 'Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance,' introduces the concept of strategy executability - whether a reasoning strategy remains effective when instantiated as guidance for a target model. Through controlled analysis of paired human-written and model-generated solutions, the researchers identified systematic differences between human- and model-derived strategies, which exhibit complementary strengths and source-dependent reversals under guidance. Building on their diagnosis, the researchers propose Selective Strategy Retrieval (SSR), a test-time framework that explicitly models executability by selectively retrieving and combining strategies using empirical, multi-route, source-aware signals. Across multiple mathematical reasoning benchmarks, SSR demonstrated consistent improvements over direct solving, in-context learning, and single-source guidance methods, achieving remarkable results with accuracy improvements of up to +13 points on AIME25 and +5 points on Apex for compact reasoning models.
Logical reasoning is a mental activity that aims to arrive at a conclusion in a rigorous way. It happens in the form of inferences or arguments by starting from a set of premises and reasoning to a conclusion supported by these premises. The premises and the conclusion are propositions, i.e.
A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the c...
--> Computer Science > Artificial Intelligence arXiv:2602.22583 [Submitted on 26 Feb 2026] Title: Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance Authors: Weida Liang , Yiyou Sun , Shuyuan Nan , Chuang Li , Dawn Song , Kenji Kawaguchi View a PDF of the paper titled Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance, by Weida Liang and 5 other authors View PDF HTML Abstract: Example-based guidance is widely used to improve mathematical reasoning at inference time, yet its effectiveness is highly unstable across problems and models-even when the guidance is correct and problem-relevant. We show that this instability arises from a previously underexplored gap between strategy usage-whether a reasoning strategy appears in successful solutions-and strategy executability-whether the strategy remains effective when instantiated as guidance for a target model. Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, leading to complementary strengths and consistent source-dependent reversals under guidance. Building on this diagnosis, we propose Selective Strategy Retrieval , a test-time framework that explicitly models executability by selectively retrieving and combining strategies using empirical, multi-route, source-aware signals. Across multiple mathematical reasoning benchmarks, SSR yields reliable and consistent improvements over direct solving, in-context learning, and single-source guidance, improving accuracy by up to $+13$ points on AIME25 and $+5$ points on Apex for compact reasoning models. Code and benchmark are publicly available at: this https URL . Subjects: Artificial Intelligence (cs.AI) ; Computation and Language (cs.CL) Cite as: arXiv:2602.22583 [cs.AI] (or arXi...