#AI Evaluation

Latest news articles tagged with "AI Evaluation". Follow the timeline of events, related topics, and entities.

Articles (13)

🇺🇸 Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach — 27/02/2026 [USA]
arXiv:2602.22585v1 Announce Type: new Abstract: Human evaluations play a central role in training and assessing AI models, yet these data are rarely treated as measurements subject to systematic erro...
Related: #Human-Rated Data, #Psychometric Modeling, #Systematic Error Correction
🇺🇸 A Benchmark for Deep Information Synthesis — 25/02/2026 [USA]
arXiv:2602.21143v1 Announce Type: new Abstract: Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data...
Related: #Information Synthesis, #Benchmark Development
🇺🇸 Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems — 25/02/2026 [USA]
arXiv:2602.20379v1 Announce Type: cross Abstract: Enterprise Retrieval-Augmented Generation (RAG) assistants operate in multi-turn, case-based workflows such as technical support and IT operations, w...
Related: #Enterprise Systems, #Retrieval-Augmented Generation
🇺🇸 The Token Games: Evaluating Language Model Reasoning with Puzzle Duels — 23/02/2026 [USA]
arXiv:2602.17831v1 Announce Type: new Abstract: Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. Human curation of hard questions is highl...
Related: #Language Model Reasoning, #Machine Learning Benchmarking
🇺🇸 VeRA: Verified Reasoning Data Augmentation at Scale — 17/02/2026 [USA]
arXiv:2602.13217v1 Announce Type: new Abstract: The main issue with most evaluation schemes today is their "static" nature: the same problems are reused repeatedly, allowing for memorization, format ...
Related: #Benchmark Robustness, #Data Augmentation, #Experimental Design, #Reproducibility
🇺🇸 Evaluating Robustness of Reasoning Models on Parameterized Logical Problems — 16/02/2026 [USA]
arXiv:2602.12665v1 Announce Type: new Abstract: Logic provides a controlled testbed for evaluating LLM-based reasoners, yet standard SAT-style benchmarks often conflate surface difficulty (length, wo...
Related: #Logical Reasoning, #Benchmark Development
🇺🇸 RADAR: Revealing Asymmetric Development of Abilities in MLLM Pre-training — 16/02/2026 [USA]
arXiv:2602.12892v1 Announce Type: cross Abstract: Pre-trained Multi-modal Large Language Models (MLLMs) provide a knowledge-rich foundation for post-training by leveraging their inherent perception a...
Related: #Machine Learning, #Multi-modal Models
🇺🇸 VoiceAgentBench: Are Voice Assistants ready for agentic tasks? — 16/02/2026 [USA]
arXiv:2510.07978v3 Announce Type: replace Abstract: Large scale Speech Language Models have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. Howe...
Related: #Speech Technology, #Benchmark Development
🇺🇸 Soft Contamination Means Benchmarks Test Shallow Generalization — 16/02/2026 [USA]
arXiv:2602.12413v1 Announce Type: cross Abstract: If LLM training data is polluted with benchmark test data, then benchmark performance gives biased estimates of out-of-distribution (OOD) generalizat...
Related: #Model Generalization, #Data Contamination
🇺🇸 RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty — 16/02/2026 [USA]
arXiv:2602.12424v1 Announce Type: cross Abstract: Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objec...
Related: #Machine Learning, #Research Innovation
🇺🇸 SCOPE: Selective Conformal Optimized Pairwise LLM Judging — 16/02/2026 [USA]
arXiv:2602.13110v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practica...
Related: #Statistical Guarantees, #LLM Optimization
🇺🇸 LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models — 09/02/2026 [USA]
arXiv:2602.06533v1 Announce Type: new Abstract: Large language models have demonstrated notable performance across various logical reasoning benchmarks. However, it remains unclear which core logical...
Related: #Artificial Intelligence, #Formal Logic
🇺🇸 Measuring the performance of our models on real-world tasks — 25/09/2025 [USA]
OpenAI introduces GDPval, a new evaluation that measures model performance on real-world economically valuable tasks across 44 occupations.
Related: #Economic Impact, #Technology Assessment, #Performance Measurement