#Benchmarking
Latest news articles tagged with "Benchmarking". Follow the timeline of events, related topics, and entities.
Articles (30)
-
πΊπΈ ToxReason: A Benchmark for Mechanistic Chemical Toxicity Reasoning via Adverse Outcome Pathway
[USA]
arXiv:2604.06264v1 Announce Type: cross Abstract: Recent advances in large language models (LLMs) have enabled molecular reasoning for property prediction. However, toxicity arises from complex biolo...
Related: #Artificial Intelligence, #Scientific Research -
πΊπΈ CoverageBench: Evaluating Information Coverage across Tasks and Domains
[USA]
arXiv:2603.20034v1 Announce Type: cross Abstract: We wish to measure the information coverage of an ad hoc retrieval algorithm, that is, how much of the range of available relevant information is cov...
Related: #AI Evaluation -
πΊπΈ CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation
[USA]
arXiv:2603.19274v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) demonstrate considerable potential in clinical diagnostics, a domain that inherently requires synthesizing c...
Related: #Clinical AI -
πΊπΈ WebPII: Benchmarking Visual PII Detection for Computer-Use Agents
[USA]
arXiv:2603.17357v1 Announce Type: cross Abstract: Computer use agents create new privacy risks: training data collected from real websites inevitably contains sensitive information, and cloud-hosted ...
Related: #Privacy, #AI Agents -
πΊπΈ Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions
[USA]
arXiv:2603.17522v1 Announce Type: cross Abstract: The rapid proliferation of large language models (LLMs) has created an urgent need for robust and generalizable detectors of machine-generated text. ...
Related: #AI Detection -
πΊπΈ Benchmarking Reinforcement Learning via Stochastic Converse Optimality: Generating Systems with Known Optimal Policies
[USA]
arXiv:2603.17631v1 Announce Type: cross Abstract: The objective comparison of Reinforcement Learning (RL) algorithms is notoriously complex as outcomes and benchmarking of performances of different R...
Related: #Reinforcement Learning -
πΊπΈ RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation
[USA]
arXiv:2510.23571v2 Announce Type: replace-cross Abstract: The pursuit of robot generalists, agents capable of performing diverse tasks across diverse environments, demands rigorous and scalable evalu...
Related: #Robotics, #Simulation -
πΊπΈ HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios
[USA]
arXiv:2603.11975v1 Announce Type: cross Abstract: The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured indu...
Related: #AI Safety -
πΊπΈ CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges
[USA]
arXiv:2603.11863v1 Announce Type: new Abstract: The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifa...
Related: #AI Creativity -
πΊπΈ RewardHackingAgents: Benchmarking Evaluation Integrity for LLM ML-Engineering Agents
[USA]
arXiv:2603.11337v1 Announce Type: new Abstract: LLM agents increasingly perform end-to-end ML engineering tasks where success is judged by a single scalar test metric. This creates a structural vulne...
Related: #AI Safety -
πΊπΈ DeliberationBench: A Normative Benchmark for the Influence of Large Language Models on Users' Views
[USA]
arXiv:2603.10018v1 Announce Type: cross Abstract: As large language models (LLMs) become pervasive as assistants and thought partners, it is important to characterize their persuasive influence on us...
Related: #AI Ethics -
πΊπΈ GRACE: A Unified 2D Multi-Robot Path Planning Simulator & Benchmark for Grid, Roadmap, And Continuous Environments
[USA]
arXiv:2603.10858v1 Announce Type: cross Abstract: Advancing Multi-Agent Pathfinding (MAPF) and Multi-Robot Motion Planning (MRMP) requires platforms that enable transparent, reproducible comparisons ...
Related: #Robotics, #Simulation -
πΊπΈ MASEval: Extending Multi-Agent Evaluation from Models to Systems
[USA]
arXiv:2603.08835v1 Announce Type: new Abstract: The rapid adoption of LLM-based agentic systems has produced a rich ecosystem of frameworks (smolagents, LangGraph, AutoGen, CAMEL, LlamaIndex, i.a.). ...
Related: #AI evaluation, #Multi-agent systems -
πΊπΈ AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems
[USA]
arXiv:2603.09435v1 Announce Type: new Abstract: The rapid rollout of AI in heterogeneous public and societal sectors has subsequently escalated the need for compliance with regulatory standards and f...
Related: #AI Regulation -
πΊπΈ SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases
[USA]
arXiv:2603.09853v1 Announce Type: cross Abstract: Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as...
Related: #Audio AI -
πΊπΈ CktEvo: Repository-Level RTL Code Benchmark for Design Evolution
[USA]
arXiv:2603.08718v1 Announce Type: cross Abstract: Register-Transfer Level (RTL) coding is an iterative, repository-scale process in which Power, Performance, and Area (PPA) emerge from interactions a...
Related: #Hardware Design -
πΊπΈ SpaceSense-Bench: A Large-Scale Multi-Modal Benchmark for Spacecraft Perception and Pose Estimation
[USA]
arXiv:2603.09320v1 Announce Type: cross Abstract: Autonomous space operations such as on-orbit servicing and active debris removal demand robust part-level semantic understanding and precise relative...
Related: #Space Technology, #Computer Vision -
πΊπΈ TML-Bench: Benchmark for Data Science Agents on Tabular ML Tasks
[USA]
arXiv:2603.05764v1 Announce Type: cross Abstract: Autonomous coding agents can produce strong tabular baselines quickly on Kaggle-style tasks. Practical value depends on end-to-end correctness and re...
Related: #Data Science -
πΊπΈ Interactive Benchmarks
[USA]
arXiv:2603.04737v1 Announce Type: new Abstract: Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's abil...
Related: #Technology -
πΊπΈ Pipeline for Verifying LLM-Generated Mathematical Solutions
[USA]
arXiv:2602.20770v1 Announce Type: new Abstract: With the growing popularity of Large Reasoning Models and their results in solving mathematical problems, it becomes crucial to measure their capabilit...
Related: #Artificial Intelligence, #Mathematical Verification -
πΊπΈ Tool Building as a Path to "Superintelligence"
[USA]
arXiv:2602.21061v1 Announce Type: new Abstract: The Diligent Learner framework suggests LLMs can achieve superintelligence via test-time search, provided a sufficient step-success probability $\gamma...
Related: #Artificial Intelligence, #Superintelligence, #Logical Reasoning -
πΊπΈ CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation
[USA]
arXiv:2602.20571v1 Announce Type: new Abstract: Many benchmarks for automated causal inference evaluate a system's performance based on a single numerical output, such as an Average Treatment Effect ...
Related: #Artificial Intelligence, #Causal Inference, #Research Evaluation -
πΊπΈ PreScience: A Benchmark for Forecasting Scientific Contributions
[USA]
arXiv:2602.20459v1 Announce Type: new Abstract: Can AI systems trained on the scientific record up to a fixed point in time forecast the scientific advances that follow? Such a capability could help ...
Related: #Artificial Intelligence, #Scientific Research, #Forecasting -
πΊπΈ Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases
[USA]
arXiv:2602.17001v1 Announce Type: new Abstract: Natural Language Querying for Time Series Databases (NLQ4TSDB) aims to assist non-expert users retrieve meaningful events, intervals, and summaries fro...
Related: #Time Series Databases, #Natural Language Querying, #NeuroβSymbolic AI, #SearchβThenβVerify Pipeline -
πΊπΈ COGITAO: A Visual Reasoning Framework To Study Compositionality & Generalization
[USA]
arXiv:2509.05249v2 Announce Type: replace-cross Abstract: The ability to compose learned concepts and apply them in novel settings is key to human intelligence, but remains a persistent limitation in...
Related: #Artificial Intelligence, #Machine Learning, #Compositionality, #Generalization -
πΊπΈ StarEmbed: Benchmarking Time Series Foundation Models on Astronomical Observations of Variable Stars
[USA]
arXiv:2510.06200v3 Announce Type: replace-cross Abstract: Time series foundation models (TSFMs) are increasingly being adopted as highly-capable general-purpose time series representation learners. A...
Related: #Time series foundation models, #Astronomical data, #Variable stars, #Irregular sampling -
πΊπΈ Decision Making under Imperfect Recall: Algorithms and Benchmarks
[USA]
arXiv:2602.15252v1 Announce Type: cross Abstract: In game theory, imperfect-recall decision problems model situations in which an agent forgets information it held before. They encompass games such a...
Related: #Game theory, #Imperfect recall, #Algorithm evaluation, #AI privacy -
πΊπΈ SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
[USA]
arXiv:2602.12670v1 Announce Type: new Abstract: Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard wa...
Related: #AI evaluation, #Agent capabilities -
πΊπΈ LIBERO-X: Robustness Litmus for Vision-Language-Action Models
[USA]
arXiv:2602.06556v1 Announce Type: cross Abstract: Reliable benchmarking is critical for advancing Vision-Language-Action (VLA) models, as it reveals their generalization, robustness, and alignment of...
Related: #Artificial Intelligence, #Robotics -
πΊπΈ OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling
[USA]
arXiv:2601.19924v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling, fostering a rapid expansion of new methodologies and eva...
Related: #Technology, #Optimization
Key Entities (11)
- Large language model (3 news)
- NLP (1 news)
- Rag (1 news)
- Artificial Intelligence Act (1 news)
- Cure (disambiguation) (1 news)
- Reasoning model (1 news)
- Benchmarking (1 news)
- Grace (1 news)
- Benchmark (1 news)
- Superintelligence (1 news)
- RTL (1 news)
About the topic: Benchmarking
The topic "Benchmarking" aggregates 30+ news articles from various countries.