UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs
#UpSkill#Mutual Information#Large Language Models#Reinforcement Learning#Response Diversity#pass@k#GSM8K#Open-weight Models
📌 Key Takeaways
Researchers developed UpSkill, a training method for LLMs that enhances response diversity while maintaining accuracy
Standard RL approaches suppress response diversity across repeated attempts, overlooking alternative strategies
Experiments on GSM8K with three open-weight models showed ~3% improvement in pass@k without degrading pass@1
Improvements in pass@k are closely tied to the mutual information objective
📖 Full Retelling
Researchers Devan Shah, Owen Yang, Daniel Yang, Chongyi Zheng, and Benjamin Eysenbach introduced UpSkill, a novel training method for large language models that enhances response diversity while maintaining accuracy, in a paper submitted to arXiv on February 25, 2026. The research addresses limitations in standard reinforcement learning approaches that inadvertently suppress response diversity across repeated attempts, potentially overlooking alternative problem-solving strategies. By adapting Mutual Information Skill Learning to LLMs and implementing a token-level mutual information reward within Group Relative Policy Optimization, the team demonstrated significant improvements in multi-attempt metrics across multiple open-weight models. The UpSkill methodology specifically targets the optimization of pass@k correctness, which measures the probability of obtaining at least k correct responses across multiple attempts, contrasting with traditional approaches that focus solely on pass@1 (single-attempt accuracy). The researchers conducted experiments on the GSM8K dataset using three prominent open-weight models: Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, with results showing mean gains of approximately 3% in pass@k for both Qwen and Llama models without degrading pass@1 performance. The study provides both empirical and theoretical evidence demonstrating that improvements in pass@k metrics are directly correlated with the mutual information objective implemented in UpSkill, suggesting that encouraging trajectory specificity through token-level mutual information rewards can effectively enhance the diversity of problem-solving approaches while maintaining accuracy.
🏷️ Themes
Machine Learning, Artificial Intelligence, Language Models
In machine learning and optimal control, reinforcement learning (RL) is concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learnin...
A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the c...
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" (in units such as shannons (bits), nats or hartleys) obtained about one rand...
--> Computer Science > Machine Learning arXiv:2602.22296 [Submitted on 25 Feb 2026] Title: UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs Authors: Devan Shah , Owen Yang , Daniel Yang , Chongyi Zheng , Benjamin Eysenbach View a PDF of the paper titled UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs, by Devan Shah and 4 other authors View PDF HTML Abstract: Reinforcement Learning with Verifiable Rewards has improved the reasoning abilities of large language models on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization : a token-level mutual information reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of ~3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass@k are closely tied to the mutual information objective. Comments: First two authors equal contribution. 29 pages total (11 pages main text), 10 figures, 10 tables. Project website: this https URL Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI) Cite as: arXiv:2602.22296 [cs.LG] (or arXiv:2602.22296v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.22296 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Owen Yang [ view email ] [v1] Wed, 25 Feb 2026 15:34:14...