#Benchmarking
Latest news articles tagged with "Benchmarking". Follow the timeline of events, related topics, and entities.
Articles (11)
-
๐บ๐ธ Tool Building as a Path to "Superintelligence"
[USA]
arXiv:2602.21061v1 Announce Type: new Abstract: The Diligent Learner framework suggests LLMs can achieve superintelligence via test-time search, provided a sufficient step-success probability $\gamma...
Related: #Artificial Intelligence, #Superintelligence, #Logical Reasoning -
๐บ๐ธ CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation
[USA]
arXiv:2602.20571v1 Announce Type: new Abstract: Many benchmarks for automated causal inference evaluate a system's performance based on a single numerical output, such as an Average Treatment Effect ...
Related: #Artificial Intelligence, #Causal Inference, #Research Evaluation -
๐บ๐ธ Pipeline for Verifying LLM-Generated Mathematical Solutions
[USA]
arXiv:2602.20770v1 Announce Type: new Abstract: With the growing popularity of Large Reasoning Models and their results in solving mathematical problems, it becomes crucial to measure their capabilit...
Related: #Artificial Intelligence, #Mathematical Verification -
๐บ๐ธ PreScience: A Benchmark for Forecasting Scientific Contributions
[USA]
arXiv:2602.20459v1 Announce Type: new Abstract: Can AI systems trained on the scientific record up to a fixed point in time forecast the scientific advances that follow? Such a capability could help ...
Related: #Artificial Intelligence, #Scientific Research, #Forecasting -
๐บ๐ธ Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases
[USA]
arXiv:2602.17001v1 Announce Type: new Abstract: Natural Language Querying for Time Series Databases (NLQ4TSDB) aims to assist non-expert users retrieve meaningful events, intervals, and summaries fro...
Related: #Time Series Databases, #Natural Language Querying, #NeuroโSymbolic AI, #SearchโThenโVerify Pipeline -
๐บ๐ธ StarEmbed: Benchmarking Time Series Foundation Models on Astronomical Observations of Variable Stars
[USA]
arXiv:2510.06200v3 Announce Type: replace-cross Abstract: Time series foundation models (TSFMs) are increasingly being adopted as highly-capable general-purpose time series representation learners. A...
Related: #Time series foundation models, #Astronomical data, #Variable stars, #Irregular sampling -
๐บ๐ธ COGITAO: A Visual Reasoning Framework To Study Compositionality & Generalization
[USA]
arXiv:2509.05249v2 Announce Type: replace-cross Abstract: The ability to compose learned concepts and apply them in novel settings is key to human intelligence, but remains a persistent limitation in...
Related: #Artificial Intelligence, #Machine Learning, #Compositionality, #Generalization -
๐บ๐ธ Decision Making under Imperfect Recall: Algorithms and Benchmarks
[USA]
arXiv:2602.15252v1 Announce Type: cross Abstract: In game theory, imperfect-recall decision problems model situations in which an agent forgets information it held before. They encompass games such a...
Related: #Game theory, #Imperfect recall, #Algorithm evaluation, #AI privacy -
๐บ๐ธ SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
[USA]
arXiv:2602.12670v1 Announce Type: new Abstract: Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard wa...
Related: #AI evaluation, #Agent capabilities -
๐บ๐ธ LIBERO-X: Robustness Litmus for Vision-Language-Action Models
[USA]
arXiv:2602.06556v1 Announce Type: cross Abstract: Reliable benchmarking is critical for advancing Vision-Language-Action (VLA) models, as it reveals their generalization, robustness, and alignment of...
Related: #Artificial Intelligence, #Robotics -
๐บ๐ธ OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling
[USA]
arXiv:2601.19924v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling, fostering a rapid expansion of new methodologies and eva...
Related: #Technology, #Optimization