SP
BravenNow
AI benchmarks are broken. Here’s what we need instead.
| USA | technology | ✓ Verified - technologyreview.com

AI benchmarks are broken. Here’s what we need instead.

#AI benchmarks #human-AI interaction #context-specific evaluation #systemic risks #organizational workflows

📌 Key Takeaways

  • Current AI benchmarks compare AI to humans on isolated tasks, but real-world use involves complex, team-based environments.
  • Misalignment between testing and usage leads to misunderstandings of AI capabilities and systemic risks.
  • Researchers propose shifting to Human-AI, Context-Specific Evaluation (HAIC) benchmarks for longer-term, team-based assessments.
  • AI's performance emerges over extended use in organizational workflows, not in isolated task evaluations.

📖 Full Retelling

For decades, artificial intelligence has been evaluated through the question of whether machines outperform humans. From chess to advanced math, from coding to essay writing, the performance of AI models and applications is tested against that of individual humans completing tasks. This framing is seductive: An AI vs. human comparison on isolated problems with clear right or wrong answers is easy to standardize, compare, and optimize. It generates rankings and headlines. But there’s a problem: AI is almost never used in the way it is benchmarked. Although researchers and industry have started to improve benchmarking by moving beyond static tests to more dynamic evaluation methods , these innovations resolve only part of the issue. That’s because they still evaluate AI’s performance outside the human teams and organizational workflows where its real-world performance ultimately unfolds. While AI is evaluated at the task level in a vacuum, it is used in messy, complex environments where it usually interacts with more than one person. Its performance (or lack thereof) emerges only over extended periods of use. This misalignment leaves us misunderstanding AI’s capabilities, overlooking systemic risks, and misjudging its economic and social consequences. To mitigate this, it’s time to shift from narrow methods to benchmarks that assess how AI systems perform over longer time horizons within human teams, workflows, and organizations. I have studied real-world AI deployment since 2022 in small businesses and health, humanitarian, nonprofit, and higher-education organizations in the UK, the United States, and Asia, as well as within leading AI design ecosystems in London and Silicon Valley. I propose a different approach, which I call HAIC benchmarks — Human–AI, Context-Specific Evaluation . What happens when AI fails For governments and businesses, AI benchmark scores appear more objective than vendor

🏷️ Themes

AI Evaluation, Benchmark Reform

📚 Related People & Topics

Large language model

Type of machine learning model

A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the c...

View Profile → Wikipedia ↗

Entity Intersection Graph

Connections for Large language model:

🌐 Artificial intelligence 3 shared
🌐 Reinforcement learning 3 shared
🌐 Educational technology 2 shared
🌐 Benchmark 2 shared
🏢 OpenAI 2 shared
View full profile

Mentioned Entities

Large language model

Type of machine learning model

Deep Analysis

Why It Matters

This news matters because current AI benchmarking methods create dangerous misalignments between how AI is tested versus how it's actually used in real-world settings. This affects businesses, governments, and organizations that rely on AI performance metrics to make critical decisions about adoption and investment. The flawed benchmarking leads to underestimation of systemic risks, misunderstanding of AI's true capabilities, and poor judgment about economic and social impacts. Ultimately, this affects everyone from policymakers to end-users who interact with AI systems in healthcare, education, business operations, and other sectors.

Context & Background

  • Traditional AI benchmarking has focused on isolated tasks like chess, math problems, coding challenges, and writing tasks since the early days of AI research
  • The Turing Test, proposed in 1950, established the foundational concept of comparing machine intelligence to human intelligence
  • Recent years have seen the rise of standardized benchmarks like GLUE, SuperGLUE, and MMLU that measure AI performance on specific tasks
  • Major AI companies and research institutions have heavily relied on these benchmarks to demonstrate progress and claim superiority
  • There's growing recognition that static benchmarks don't capture how AI performs in collaborative, real-world environments where humans and machines interact

What Happens Next

We can expect increased research into Human-AI collaborative benchmarking methods like the proposed HAIC (Human–AI, Context-Specific Evaluation) framework. Organizations will likely begin piloting these new evaluation approaches in 2024-2025, particularly in sectors like healthcare, education, and business operations where AI integration is already advanced. Regulatory bodies may start considering these more comprehensive evaluation methods for AI certification and safety standards within the next 2-3 years.

Frequently Asked Questions

What are HAIC benchmarks?

HAIC stands for Human–AI, Context-Specific Evaluation, a proposed benchmarking approach that assesses AI performance within actual human teams, workflows, and organizational contexts over extended periods rather than in isolated task environments.

Why are current AI benchmarks considered broken?

Current benchmarks evaluate AI in artificial, isolated conditions that don't reflect real-world usage where AI interacts with multiple humans in complex environments. This creates misleading performance assessments and overlooks systemic risks that only emerge during actual deployment.

Who is most affected by flawed AI benchmarking?

Organizations implementing AI solutions, policymakers creating AI regulations, investors funding AI development, and end-users who depend on AI systems in critical applications like healthcare, education, and business operations are all affected by inaccurate benchmarking.

What industries were studied in the research mentioned?

The research examined AI deployment in small businesses, healthcare, humanitarian organizations, nonprofits, higher education institutions across the UK, United States, and Asia, plus leading AI design ecosystems in London and Silicon Valley.

How do flawed benchmarks impact AI development?

Flawed benchmarks encourage optimization for artificial test conditions rather than real-world performance, potentially leading to AI systems that perform well in labs but fail or create unexpected problems in actual organizational settings.

Status: Partially Verified
Confidence: 75%
Source: MIT Technology Review

Source Scoring

78 Overall
Decision
Normal
Low Norm High Push

Detailed Metrics

Reliability 80/100
Importance 85/100
Corroboration 70/100
Scope Clarity 75/100
Volatility Risk (Low is better) 30/100

Key Claims Verified

AI is almost never used in the way it is benchmarked; it's used in messy, complex environments interacting with multiple people over time. Confirmed

Widely acknowledged in AI ethics and human-computer interaction literature. Supported by studies on 'in-the-wild' AI evaluation.

Current benchmarking misalignment leads to misunderstanding AI capabilities, overlooking systemic risks, and misjudging economic/social consequences. Confirmed

Multiple academic papers and reports (e.g., from Stanford HAI, Partnership on AI) discuss the limitations of static benchmarks and their societal impact gaps.

The author proposes 'HAIC benchmarks' (Human–AI, Context-Specific Evaluation) as a needed alternative. Partial

The specific term 'HAIC benchmarks' appears unique to this article. However, the core concept of evaluating AI within human teams and workflows is an active research area (e.g., collaborative AI, human-AI teaming benchmarks).

The author has studied real-world AI deployment since 2022 in various sectors and regions (UK, US, Asia, London, Silicon Valley). Unclear

Author's specific research portfolio and affiliations are not detailed in the provided text. The claim is plausible but requires external sourcing for full verification.

Supporting Evidence

  • High Stanford Institute for Human-Centered AI (HAI) - 'The AI Index Report 2024' [Link]
  • High Papers with Code - 'Beyond the Imitation Game' (BIG-bench) & Dynamic Evaluation [Link]
  • Medium ACM Conference on Human Factors in Computing Systems (CHI) - Proceedings on Human-AI Interaction [Link]
  • Medium MIT Technology Review (Publisher Reputation) [Link]

Caveats / Notes

  • The article is an opinion/analysis piece, not a news report of a specific event. The proposed 'HAIC' framework is a conceptual contribution; its novelty and adoption are not yet established. The future publication date (2026) suggests this is an advance or speculative publication.
}
Original Source
For decades, artificial intelligence has been evaluated through the question of whether machines outperform humans. From chess to advanced math, from coding to essay writing, the performance of AI models and applications is tested against that of individual humans completing tasks. This framing is seductive: An AI vs. human comparison on isolated problems with clear right or wrong answers is easy to standardize, compare, and optimize. It generates rankings and headlines. But there’s a problem: AI is almost never used in the way it is benchmarked. Although researchers and industry have started to improve benchmarking by moving beyond static tests to more dynamic evaluation methods , these innovations resolve only part of the issue. That’s because they still evaluate AI’s performance outside the human teams and organizational workflows where its real-world performance ultimately unfolds. While AI is evaluated at the task level in a vacuum, it is used in messy, complex environments where it usually interacts with more than one person. Its performance (or lack thereof) emerges only over extended periods of use. This misalignment leaves us misunderstanding AI’s capabilities, overlooking systemic risks, and misjudging its economic and social consequences. To mitigate this, it’s time to shift from narrow methods to benchmarks that assess how AI systems perform over longer time horizons within human teams, workflows, and organizations. I have studied real-world AI deployment since 2022 in small businesses and health, humanitarian, nonprofit, and higher-education organizations in the UK, the United States, and Asia, as well as within leading AI design ecosystems in London and Silicon Valley. I propose a different approach, which I call HAIC benchmarks — Human–AI, Context-Specific Evaluation . What happens when AI fails For governments and businesses, AI benchmark scores appear more objective than vendor
Read full article at source

Source

technologyreview.com

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine