#AI Evaluation
Latest news articles tagged with "AI Evaluation". Follow the timeline of events, related topics, and entities.
Articles (13)
-
πΊπΈ Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach
[USA]
arXiv:2602.22585v1 Announce Type: new Abstract: Human evaluations play a central role in training and assessing AI models, yet these data are rarely treated as measurements subject to systematic erro...
Related: #Human-Rated Data, #Psychometric Modeling, #Systematic Error Correction -
πΊπΈ A Benchmark for Deep Information Synthesis
[USA]
arXiv:2602.21143v1 Announce Type: new Abstract: Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data...
Related: #Information Synthesis, #Benchmark Development -
πΊπΈ Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems
[USA]
arXiv:2602.20379v1 Announce Type: cross Abstract: Enterprise Retrieval-Augmented Generation (RAG) assistants operate in multi-turn, case-based workflows such as technical support and IT operations, w...
Related: #Enterprise Systems, #Retrieval-Augmented Generation -
πΊπΈ The Token Games: Evaluating Language Model Reasoning with Puzzle Duels
[USA]
arXiv:2602.17831v1 Announce Type: new Abstract: Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. Human curation of hard questions is highl...
Related: #Language Model Reasoning, #Machine Learning Benchmarking -
πΊπΈ VeRA: Verified Reasoning Data Augmentation at Scale
[USA]
arXiv:2602.13217v1 Announce Type: new Abstract: The main issue with most evaluation schemes today is their "static" nature: the same problems are reused repeatedly, allowing for memorization, format ...
Related: #Benchmark Robustness, #Data Augmentation, #Experimental Design, #Reproducibility -
πΊπΈ Evaluating Robustness of Reasoning Models on Parameterized Logical Problems
[USA]
arXiv:2602.12665v1 Announce Type: new Abstract: Logic provides a controlled testbed for evaluating LLM-based reasoners, yet standard SAT-style benchmarks often conflate surface difficulty (length, wo...
Related: #Logical Reasoning, #Benchmark Development -
πΊπΈ RADAR: Revealing Asymmetric Development of Abilities in MLLM Pre-training
[USA]
arXiv:2602.12892v1 Announce Type: cross Abstract: Pre-trained Multi-modal Large Language Models (MLLMs) provide a knowledge-rich foundation for post-training by leveraging their inherent perception a...
Related: #Machine Learning, #Multi-modal Models -
πΊπΈ VoiceAgentBench: Are Voice Assistants ready for agentic tasks?
[USA]
arXiv:2510.07978v3 Announce Type: replace Abstract: Large scale Speech Language Models have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. Howe...
Related: #Speech Technology, #Benchmark Development -
πΊπΈ Soft Contamination Means Benchmarks Test Shallow Generalization
[USA]
arXiv:2602.12413v1 Announce Type: cross Abstract: If LLM training data is polluted with benchmark test data, then benchmark performance gives biased estimates of out-of-distribution (OOD) generalizat...
Related: #Model Generalization, #Data Contamination -
πΊπΈ RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty
[USA]
arXiv:2602.12424v1 Announce Type: cross Abstract: Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objec...
Related: #Machine Learning, #Research Innovation -
πΊπΈ SCOPE: Selective Conformal Optimized Pairwise LLM Judging
[USA]
arXiv:2602.13110v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practica...
Related: #Statistical Guarantees, #LLM Optimization -
πΊπΈ LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models
[USA]
arXiv:2602.06533v1 Announce Type: new Abstract: Large language models have demonstrated notable performance across various logical reasoning benchmarks. However, it remains unclear which core logical...
Related: #Artificial Intelligence, #Formal Logic -
πΊπΈ Measuring the performance of our models on real-world tasks
[USA]
OpenAI introduces GDPval, a new evaluation that measures model performance on real-world economically valuable tasks across 44 occupations.
Related: #Economic Impact, #Technology Assessment, #Performance Measurement