Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
#Duel-Evolve #Large Language Models #Evolutionary Optimization #Pairwise Comparisons #Bayesian Bradley-Terry Model #Test-Time Scaling #Self-Preferences #MathBench
📌 Key Takeaways
- Duel-Evolve replaces external reward systems with self-generated pairwise preferences in LLMs
- The method eliminates the need for reward models, ground-truth labels, and hand-crafted scoring functions
- It achieved 20 percentage points higher accuracy on MathBench compared to existing methods
- It improved performance by over 12 percentage points on LiveCodeBench
📖 Full Retelling
Researchers led by Sweta Karlekar introduced Duel-Evolve, a novel evolutionary optimization algorithm for large language models that replaces external reward systems with self-generated pairwise preferences, in a paper submitted to arXiv on February 25, 2026, addressing the challenge of optimizing LLM outputs when traditional scoring methods are unavailable or unreliable. The research team, which includes Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu, John Bowlan, Michal Kucer, and David Blei, developed this approach to overcome limitations in existing methods that rely on calibrated scalar evaluators which often prove problematic as they may be unavailable, too sparse, or unreliable for many tasks. Duel-Evolve represents a significant advancement in how we can iteratively improve LLM outputs without depending on external supervision or hand-crafted reward systems. The methodology behind Duel-Evolve involves aggregating noisy candidate comparisons using a Bayesian Bradley-Terry model to yield uncertainty-aware estimates of candidate quality, which then guide the allocation of comparison budget toward plausible optima using Double Thompson Sampling, as well as selection of high-quality parents to generate improved candidates. When evaluated on MathBench and LiveCodeBench benchmarks, the algorithm achieved impressive results, demonstrating 20 percentage points higher accuracy than existing methods on MathBench and over 12 percentage points improvement over comparable iterative methods on LiveCodeBench, proving that pairwise self-preferences provide a strong optimization signal for test-time improvement over large, discrete output spaces.
🏷️ Themes
Machine Learning Optimization, Large Language Models, Reward-Free AI Systems
📚 Related People & Topics
Large language model
Type of machine learning model
A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the c...
Entity Intersection Graph
Connections for Large language model:
🌐
Educational technology
4 shared
🌐
Reinforcement learning
3 shared
🌐
Machine learning
2 shared
🌐
Artificial intelligence
2 shared
🌐
Benchmark
2 shared
Original Source
--> Computer Science > Machine Learning arXiv:2602.21585 [Submitted on 25 Feb 2026] Title: Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences Authors: Sweta Karlekar , Carolina Zheng , Magnus Saebo , Nicolas Beltran-Velez , Shuyang Yu , John Bowlan , Michal Kucer , David Blei View a PDF of the paper titled Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences, by Sweta Karlekar and 7 other authors View PDF HTML Abstract: Many applications seek to optimize LLM outputs at test time by iteratively proposing, scoring, and refining candidates over a discrete output space. Existing methods use a calibrated scalar evaluator for the target objective to guide search, but for many tasks such scores are unavailable, too sparse, or unreliable. Pairwise comparisons, by contrast, are often easier to elicit, still provide useful signal on improvement directions, and can be obtained from the LLM itself without external supervision. Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates. Duel-Evolve aggregates these noisy candidate comparisons via a Bayesian Bradley-Terry model, yielding uncertainty-aware estimates of candidate quality. These quality estimates guide allocation of the comparison budget toward plausible optima using Double Thompson Sampling, as well as selection of high-quality parents to generate improved candidates. We evaluate Duel-Evolve on MathBench, where it achieves 20 percentage points higher accuracy over existing methods and baselines, and on LiveCodeBench, where it improves over comparable iterative methods by over 12 percentage points. Notably, the method requires no reward model, no ground-truth labels during search, and no hand-crafted scoring function. Results show that pairwise self-preferences provide strong optimization signal for test-time improvement over large,...
Read full article at source