#AI evaluation

Latest news articles tagged with "AI evaluation". Follow the timeline of events, related topics, and entities.

Articles (5)

🇺🇸 A Framework for Assessing AI Agent Decisions and Outcomes in AutoML Pipelines — 27/02/2026 [USA]
arXiv:2602.22442v1 Announce Type: new Abstract: Agent-based AutoML systems rely on large language models to make complex, multi-stage decisions across data processing, model selection, and evaluation...
Related: #AutoML systems, #Decision quality assessment
🇺🇸 Implicit Intelligence -- Evaluating Agents on What Users Don't Say — 25/02/2026 [USA]
arXiv:2602.20424v1 Announce Type: new Abstract: Real-world requests to AI agents are fundamentally underspecified. Natural human communication relies on shared context and unstated constraints that s...
Related: #Contextual reasoning, #Human-computer interaction
🇺🇸 A Theoretical Framework for Adaptive Utility-Weighted Benchmarking — 16/02/2026 [USA]
arXiv:2602.12356v1 Announce Type: new Abstract: Benchmarking has long served as a foundational practice in machine learning and, increasingly, in modern AI systems such as large language models, wher...
Related: #Benchmarking methodologies, #Machine learning progress
🇺🇸 From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems — 16/02/2026 [USA]
arXiv:2512.18080v2 Announce Type: replace-cross Abstract: Agentic AI systems capable of generating full-stack web applications from natural language prompts ("prompt- to-app") represent a significant...
Related: #Software development, #Human-centered design
🇺🇸 SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks — 16/02/2026 [USA]
arXiv:2602.12670v1 Announce Type: new Abstract: Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard wa...
Related: #Benchmarking, #Agent capabilities