#Benchmarking

Latest news articles tagged with "Benchmarking". Follow the timeline of events, related topics, and entities.

Articles (30)

🇺🇸 ToxReason: A Benchmark for Mechanistic Chemical Toxicity Reasoning via Adverse Outcome Pathway — 09/04/2026 [USA]
arXiv:2604.06264v1 Announce Type: cross Abstract: Recent advances in large language models (LLMs) have enabled molecular reasoning for property prediction. However, toxicity arises from complex biolo...
Related: #Artificial Intelligence, #Scientific Research
🇺🇸 CoverageBench: Evaluating Information Coverage across Tasks and Domains — 23/03/2026 [USA]
arXiv:2603.20034v1 Announce Type: cross Abstract: We wish to measure the information coverage of an ad hoc retrieval algorithm, that is, how much of the range of available relevant information is cov...
Related: #AI Evaluation
🇺🇸 CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation — 23/03/2026 [USA]
arXiv:2603.19274v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) demonstrate considerable potential in clinical diagnostics, a domain that inherently requires synthesizing c...
Related: #Clinical AI
🇺🇸 WebPII: Benchmarking Visual PII Detection for Computer-Use Agents — 19/03/2026 [USA]
arXiv:2603.17357v1 Announce Type: cross Abstract: Computer use agents create new privacy risks: training data collected from real websites inevitably contains sensitive information, and cloud-hosted ...
Related: #Privacy, #AI Agents
🇺🇸 Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions — 19/03/2026 [USA]
arXiv:2603.17522v1 Announce Type: cross Abstract: The rapid proliferation of large language models (LLMs) has created an urgent need for robust and generalizable detectors of machine-generated text. ...
Related: #AI Detection
🇺🇸 Benchmarking Reinforcement Learning via Stochastic Converse Optimality: Generating Systems with Known Optimal Policies — 19/03/2026 [USA]
arXiv:2603.17631v1 Announce Type: cross Abstract: The objective comparison of Reinforcement Learning (RL) algorithms is notoriously complex as outcomes and benchmarking of performances of different R...
Related: #Reinforcement Learning
🇺🇸 RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation — 16/03/2026 [USA]
arXiv:2510.23571v2 Announce Type: replace-cross Abstract: The pursuit of robot generalists, agents capable of performing diverse tasks across diverse environments, demands rigorous and scalable evalu...
Related: #Robotics, #Simulation
🇺🇸 HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios — 13/03/2026 [USA]
arXiv:2603.11975v1 Announce Type: cross Abstract: The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured indu...
Related: #AI Safety
🇺🇸 CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges — 13/03/2026 [USA]
arXiv:2603.11863v1 Announce Type: new Abstract: The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifa...
Related: #AI Creativity
🇺🇸 RewardHackingAgents: Benchmarking Evaluation Integrity for LLM ML-Engineering Agents — 13/03/2026 [USA]
arXiv:2603.11337v1 Announce Type: new Abstract: LLM agents increasingly perform end-to-end ML engineering tasks where success is judged by a single scalar test metric. This creates a structural vulne...
Related: #AI Safety
🇺🇸 DeliberationBench: A Normative Benchmark for the Influence of Large Language Models on Users' Views — 12/03/2026 [USA]
arXiv:2603.10018v1 Announce Type: cross Abstract: As large language models (LLMs) become pervasive as assistants and thought partners, it is important to characterize their persuasive influence on us...
Related: #AI Ethics
🇺🇸 GRACE: A Unified 2D Multi-Robot Path Planning Simulator & Benchmark for Grid, Roadmap, And Continuous Environments — 12/03/2026 [USA]
arXiv:2603.10858v1 Announce Type: cross Abstract: Advancing Multi-Agent Pathfinding (MAPF) and Multi-Robot Motion Planning (MRMP) requires platforms that enable transparent, reproducible comparisons ...
Related: #Robotics, #Simulation
🇺🇸 MASEval: Extending Multi-Agent Evaluation from Models to Systems — 11/03/2026 [USA]
arXiv:2603.08835v1 Announce Type: new Abstract: The rapid adoption of LLM-based agentic systems has produced a rich ecosystem of frameworks (smolagents, LangGraph, AutoGen, CAMEL, LlamaIndex, i.a.). ...
Related: #AI evaluation, #Multi-agent systems
🇺🇸 AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems — 11/03/2026 [USA]
arXiv:2603.09435v1 Announce Type: new Abstract: The rapid rollout of AI in heterogeneous public and societal sectors has subsequently escalated the need for compliance with regulatory standards and f...
Related: #AI Regulation
🇺🇸 SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases — 11/03/2026 [USA]
arXiv:2603.09853v1 Announce Type: cross Abstract: Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as...
Related: #Audio AI
🇺🇸 CktEvo: Repository-Level RTL Code Benchmark for Design Evolution — 11/03/2026 [USA]
arXiv:2603.08718v1 Announce Type: cross Abstract: Register-Transfer Level (RTL) coding is an iterative, repository-scale process in which Power, Performance, and Area (PPA) emerge from interactions a...
Related: #Hardware Design
🇺🇸 SpaceSense-Bench: A Large-Scale Multi-Modal Benchmark for Spacecraft Perception and Pose Estimation — 11/03/2026 [USA]
arXiv:2603.09320v1 Announce Type: cross Abstract: Autonomous space operations such as on-orbit servicing and active debris removal demand robust part-level semantic understanding and precise relative...
Related: #Space Technology, #Computer Vision
🇺🇸 TML-Bench: Benchmark for Data Science Agents on Tabular ML Tasks — 09/03/2026 [USA]
arXiv:2603.05764v1 Announce Type: cross Abstract: Autonomous coding agents can produce strong tabular baselines quickly on Kaggle-style tasks. Practical value depends on end-to-end correctness and re...
Related: #Data Science
🇺🇸 Interactive Benchmarks — 06/03/2026 [USA]
arXiv:2603.04737v1 Announce Type: new Abstract: Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's abil...
Related: #Technology
🇺🇸 Pipeline for Verifying LLM-Generated Mathematical Solutions — 25/02/2026 [USA]
arXiv:2602.20770v1 Announce Type: new Abstract: With the growing popularity of Large Reasoning Models and their results in solving mathematical problems, it becomes crucial to measure their capabilit...
Related: #Artificial Intelligence, #Mathematical Verification
🇺🇸 Tool Building as a Path to "Superintelligence" — 25/02/2026 [USA]
arXiv:2602.21061v1 Announce Type: new Abstract: The Diligent Learner framework suggests LLMs can achieve superintelligence via test-time search, provided a sufficient step-success probability $\gamma...
Related: #Artificial Intelligence, #Superintelligence, #Logical Reasoning
🇺🇸 CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation — 25/02/2026 [USA]
arXiv:2602.20571v1 Announce Type: new Abstract: Many benchmarks for automated causal inference evaluate a system's performance based on a single numerical output, such as an Average Treatment Effect ...
Related: #Artificial Intelligence, #Causal Inference, #Research Evaluation
🇺🇸 PreScience: A Benchmark for Forecasting Scientific Contributions — 25/02/2026 [USA]
arXiv:2602.20459v1 Announce Type: new Abstract: Can AI systems trained on the scientific record up to a fixed point in time forecast the scientific advances that follow? Such a capability could help ...
Related: #Artificial Intelligence, #Scientific Research, #Forecasting
🇺🇸 Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases — 20/02/2026 [USA]
arXiv:2602.17001v1 Announce Type: new Abstract: Natural Language Querying for Time Series Databases (NLQ4TSDB) aims to assist non-expert users retrieve meaningful events, intervals, and summaries fro...
Related: #Time Series Databases, #Natural Language Querying, #Neuro‑Symbolic AI, #Search‑Then‑Verify Pipeline
🇺🇸 COGITAO: A Visual Reasoning Framework To Study Compositionality & Generalization — 19/02/2026 [USA]
arXiv:2509.05249v2 Announce Type: replace-cross Abstract: The ability to compose learned concepts and apply them in novel settings is key to human intelligence, but remains a persistent limitation in...
Related: #Artificial Intelligence, #Machine Learning, #Compositionality, #Generalization
🇺🇸 StarEmbed: Benchmarking Time Series Foundation Models on Astronomical Observations of Variable Stars — 19/02/2026 [USA]
arXiv:2510.06200v3 Announce Type: replace-cross Abstract: Time series foundation models (TSFMs) are increasingly being adopted as highly-capable general-purpose time series representation learners. A...
Related: #Time series foundation models, #Astronomical data, #Variable stars, #Irregular sampling
🇺🇸 Decision Making under Imperfect Recall: Algorithms and Benchmarks — 18/02/2026 [USA]
arXiv:2602.15252v1 Announce Type: cross Abstract: In game theory, imperfect-recall decision problems model situations in which an agent forgets information it held before. They encompass games such a...
Related: #Game theory, #Imperfect recall, #Algorithm evaluation, #AI privacy
🇺🇸 SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks — 16/02/2026 [USA]
arXiv:2602.12670v1 Announce Type: new Abstract: Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard wa...
Related: #AI evaluation, #Agent capabilities
🇺🇸 LIBERO-X: Robustness Litmus for Vision-Language-Action Models — 09/02/2026 [USA]
arXiv:2602.06556v1 Announce Type: cross Abstract: Reliable benchmarking is critical for advancing Vision-Language-Action (VLA) models, as it reveals their generalization, robustness, and alignment of...
Related: #Artificial Intelligence, #Robotics
🇺🇸 OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling — 29/01/2026 [USA]
arXiv:2601.19924v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling, fostering a rapid expansion of new methodologies and eva...
Related: #Technology, #Optimization

Key Entities (11)

Large language model (3 news)
NLP (1 news)
Rag (1 news)
Artificial Intelligence Act (1 news)
Cure (disambiguation) (1 news)
Reasoning model (1 news)
Benchmarking (1 news)
Grace (1 news)
Benchmark (1 news)
Superintelligence (1 news)
RTL (1 news)

About the topic: Benchmarking

The topic "Benchmarking" aggregates 30+ news articles from various countries.