SP
BravenNow
On Randomness in Agentic Evals
| USA | βœ“ Verified - arxiv.org

On Randomness in Agentic Evals

#Agentic systems #SWE-Bench-Verified #pass@1 #AI benchmarks #Statistical variance #arXiv #Large Language Models

πŸ“Œ Key Takeaways

  • Standard 'pass@1' metrics based on single runs are insufficient for measuring AI agent performance.
  • A study of 60,000 trajectories revealed performance variances between 2.2 and 6.0 percentage points.
  • Inherent randomness in agentic systems can lead to misleading benchmark rankings.
  • The researchers used SWE-Bench-Verified to test three different models and two scaffolds.

πŸ“– Full Retelling

Researchers specializing in artificial intelligence evaluation published a technical paper on the arXiv preprint server this week challenging the reliability of current agentic system benchmarks. The study highlights significant statistical volatility in AI performance metrics after the team collected and analyzed 60,000 agentic trajectories on the SWE-Bench-Verified framework, involving three major large language models and two distinct scaffolds. The investigation was prompted by concerns that the industry standard of reporting 'pass@1' scores based on a single trial per task fails to account for the inherent randomness of autonomous agent behavior, potentially leading to inaccurate leaderboards and overstated progress in the field. The core of the research focused on the variance found when an agent attempts the same task multiple times. Conventionally, AI developers report a single-run success rate, assuming it serves as a stable proxy for the model's actual capability. However, the researchers discovered that these single-run estimates can fluctuate by as much as 2.2 to 6.0 percentage points solely due to chance. This margin of error is significant enough to rearrange the rankings of top-tier models, suggesting that many current performance claims in AI research may lack the statistical rigor necessary for objective comparison. To ensure a comprehensive analysis, the team utilized SWE-Bench-Verified, a rigorous benchmark designed to test an AI's ability to resolve real-world software engineering issues. By scaling their tests to tens of thousands of internal trajectories, the authors demonstrated that what often appears to be a 'superior' model or architecture might simply be the beneficiary of a lucky run. This findings advocate for a shift in how the AI community evaluates agentic systems, suggesting that multiple trials and confidence intervals should become the new standard for reporting results to ensure scientific validity and reproducibility.

🏷️ Themes

Artificial Intelligence, Data Science, Statistics

Entity Intersection Graph

No entity connections available yet for this article.

}
Original Source
arXiv:2602.07150v1 Announce Type: cross Abstract: Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks. Most papers report a pass@1 score computed from a single run per task, assuming this gives a reliable performance estimate. We test this assumption by collecting 60,000 agentic trajectories on SWE-Bench-Verified, spanning three models and two scaffolds. We find substantial variance: single-run pass@1 estimates vary by 2.2 to 6.0 percentage points
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

πŸ‡¬πŸ‡§ United Kingdom

πŸ‡ΊπŸ‡¦ Ukraine