#AI evaluation
Latest news articles tagged with "AI evaluation". Follow the timeline of events, related topics, and entities.
Articles (5)
-
πΊπΈ A Framework for Assessing AI Agent Decisions and Outcomes in AutoML Pipelines
[USA]
arXiv:2602.22442v1 Announce Type: new Abstract: Agent-based AutoML systems rely on large language models to make complex, multi-stage decisions across data processing, model selection, and evaluation...
Related: #AutoML systems, #Decision quality assessment -
πΊπΈ Implicit Intelligence -- Evaluating Agents on What Users Don't Say
[USA]
arXiv:2602.20424v1 Announce Type: new Abstract: Real-world requests to AI agents are fundamentally underspecified. Natural human communication relies on shared context and unstated constraints that s...
Related: #Contextual reasoning, #Human-computer interaction -
πΊπΈ A Theoretical Framework for Adaptive Utility-Weighted Benchmarking
[USA]
arXiv:2602.12356v1 Announce Type: new Abstract: Benchmarking has long served as a foundational practice in machine learning and, increasingly, in modern AI systems such as large language models, wher...
Related: #Benchmarking methodologies, #Machine learning progress -
πΊπΈ From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems
[USA]
arXiv:2512.18080v2 Announce Type: replace-cross Abstract: Agentic AI systems capable of generating full-stack web applications from natural language prompts ("prompt- to-app") represent a significant...
Related: #Software development, #Human-centered design -
πΊπΈ SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
[USA]
arXiv:2602.12670v1 Announce Type: new Abstract: Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard wa...
Related: #Benchmarking, #Agent capabilities