#AI Benchmarking
Latest news articles tagged with "AI Benchmarking". Follow the timeline of events, related topics, and entities.
Articles (30)
-
πΊπΈ RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark
[USA]
arXiv:2509.24897v2 Announce Type: replace Abstract: The integration of visual understanding and generation into unified multimodal models represents a significant stride toward general-purpose AI. Ho...
Related: #Model Unification -
πΊπΈ ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models
[USA]
arXiv:2603.19515v1 Announce Type: new Abstract: Large language models (LLMs) with advanced cognitive capabilities are emerging as agents for various reasoning and planning tasks. Traditional evaluati...
Related: #Cognitive Planning -
πΊπΈ FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment
[USA]
arXiv:2603.19539v1 Announce Type: cross Abstract: We introduce an expert curated, real-world benchmark for evaluating document-grounded question-answering (QA) motivated by generic drug assessment, u...
Related: #Drug Regulation -
πΊπΈ URAG: A Benchmark for Uncertainty Quantification in Retrieval-Augmented Large Language Models
[USA]
arXiv:2603.19281v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for enhancing LLMs in scenarios that demand extensive factual knowledge...
Related: #Uncertainty Quantification -
πΊπΈ GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams
[USA]
arXiv:2603.19252v1 Announce Type: cross Abstract: Evaluating the symbolic reasoning of large language models (LLMs) calls for geometry benchmarks that require multi-step proofs grounded in both text ...
Related: #Geometric Reasoning -
πΊπΈ MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model
[USA]
arXiv:2603.18892v1 Announce Type: cross Abstract: Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical en...
Related: #Spatial Reasoning -
πΊπΈ FaithSteer-BENCH: A Deployment-Aligned Stress-Testing Benchmark for Inference-Time Steering
[USA]
arXiv:2603.18329v1 Announce Type: new Abstract: Inference-time steering is widely regarded as a lightweight and parameter-free mechanism for controlling large language model (LLM) behavior, and prior...
Related: #Inference-Time Steering -
πΊπΈ Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning
[USA]
arXiv:2603.18662v1 Announce Type: new Abstract: Geometric reasoning inherently requires "thinking with constructions" -- the dynamic manipulation of visual aids to bridge the gap between problem cond...
Related: #Geometric Reasoning -
πΊπΈ WeatherReasonSeg: A Benchmark for Weather-Aware Reasoning Segmentation in Visual Language Models
[USA]
arXiv:2603.17680v1 Announce Type: cross Abstract: Existing vision-language models (VLMs) have demonstrated impressive performance in reasoning-based segmentation. However, current benchmarks are prim...
Related: #Weather Recognition -
πΊπΈ When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents
[USA]
arXiv:2603.17104v1 Announce Type: cross Abstract: Current coding-agent benchmarks usually pro- vide the full task specification upfront. Real research coding often does not: the intended system is pr...
Related: #Code Generation -
πΊπΈ Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models
[USA]
arXiv:2603.16944v1 Announce Type: cross Abstract: While Instruction-based Image Editing (IIE) has achieved significant progress, existing benchmarks pursue task breadth via mixed evaluations. This pa...
Related: #Image Editing -
πΊπΈ AIDABench: AI Data Analytics Benchmark
[USA]
arXiv:2603.15636v1 Announce Type: new Abstract: As AI-driven document understanding and processing tools become increasingly prevalent in real-world applications, the need for rigorous evaluation sta...
Related: #Data Analytics -
πΊπΈ SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models
[USA]
arXiv:2603.16859v1 Announce Type: new Abstract: Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM bench...
Related: #Multimodal AI -
πΊπΈ 360{\deg} Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method
[USA]
arXiv:2603.16179v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have shown impressive abilities in understanding and reasoning over conventional images. However, their perc...
Related: #Computer Vision -
πΊπΈ VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents
[USA]
arXiv:2603.16289v1 Announce Type: cross Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in th...
Related: #Multimodal Agents -
πΊπΈ BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs
[USA]
arXiv:2603.16557v1 Announce Type: new Abstract: Large language models (LLMs) increasingly store user preferences in persistent memory to support personalization across interactions. However, in third...
Related: #Personalized AI -
πΊπΈ V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge in Vision Language Models
[USA]
arXiv:2603.16581v1 Announce Type: new Abstract: Vision-Language Models (VLMs) are trained on data snapshots of documents, including images and texts. Their training data and evaluation benchmarks are...
Related: #Temporal Knowledge -
πΊπΈ NeSy-Route: A Neuro-Symbolic Benchmark for Constrained Route Planning in Remote Sensing
[USA]
arXiv:2603.16307v1 Announce Type: new Abstract: Remote sensing underpins crucial applications such as disaster relief and ecological field surveys, where systems must understand complex scenes and co...
Related: #Remote Sensing -
πΊπΈ CUBE: A Standard for Unifying Agent Benchmarks
[USA]
arXiv:2603.15798v1 Announce Type: new Abstract: The proliferation of agent benchmarks has created critical fragmentation that threatens research productivity. Each new benchmark requires substantial ...
Related: #Agent Evaluation -
πΊπΈ SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding
[USA]
arXiv:2603.16124v1 Announce Type: cross Abstract: Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. ...
Related: #Code Understanding -
πΊπΈ Benchmarking Zero-Shot Reasoning Approaches for Error Detection in Solidity Smart Contracts
[USA]
arXiv:2603.13239v1 Announce Type: new Abstract: Smart contracts play a central role in blockchain systems by encoding financial and operational logic. Still, their susceptibility to subtle security f...
Related: #Blockchain Security -
πΊπΈ ManiBench: A Benchmark for Testing Visual-Logic Drift and Syntactic Hallucinations in Manim Code Generation
[USA]
arXiv:2603.13251v1 Announce Type: new Abstract: Traditional benchmarks like HumanEval and MBPP test logic and syntax effectively, but fail when code must produce dynamic, pedagogical visuals. We intr...
Related: #Code Generation -
πΊπΈ ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation
[USA]
arXiv:2603.13154v1 Announce Type: cross Abstract: As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal require...
Related: #ESG Reporting -
πΊπΈ NeuCo-Bench: A Novel Benchmark Framework for Neural Embeddings in Earth Observation
[USA]
arXiv:2510.17914v2 Announce Type: replace-cross Abstract: We introduce NeuCo-Bench, a novel benchmark framework for evaluating (lossy) neural compression and representation learning in the context of...
Related: #Earth Observation -
πΊπΈ OSCBench: Benchmarking Object State Change in Text-to-Video Generation
[USA]
arXiv:2603.11698v1 Announce Type: cross Abstract: Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing b...
Related: #Video Generation -
πΊπΈ FinRule-Bench: A Benchmark for Joint Reasoning over Financial Tables and Principles
[USA]
arXiv:2603.11339v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly applied to financial analysis, yet their ability to audit structured financial statements under explicit ...
Related: #Financial Regulation -
πΊπΈ TopoBench: Benchmarking LLMs on Hard Topological Reasoning
[USA]
arXiv:2603.12133v1 Announce Type: new Abstract: Solving topological grid puzzles requires reasoning over global spatial invariants such as connectivity, loop closure, and region symmetry and remains ...
Related: #Topological Reasoning -
πΊπΈ SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning
[USA]
arXiv:2603.12249v1 Announce Type: cross Abstract: Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness...
Related: #Scientific Documents -
πΊπΈ BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs
[USA]
arXiv:2603.11991v1 Announce Type: cross Abstract: Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable la...
Related: #Text Classification -
πΊπΈ Evaluating Progress in Graph Foundation Models: A Comprehensive Benchmark and New Insights
[USA]
arXiv:2603.10033v1 Announce Type: cross Abstract: Graph foundation models (GFM) aim to acquire transferable knowledge by pre-training on diverse graphs, which can be adapted to various downstream tas...
Related: #Graph Foundation Models
Key Entities (9)
- Large language model (3 news)
- AI agent (2 news)
- Language model (1 news)
- Artificial intelligence (1 news)
- Food and Drug Administration (1 news)
- Earth observation (1 news)
- Solidity (1 news)
- Error detection and correction (1 news)
- Remote sensing (1 news)
About the topic: AI Benchmarking
The topic "AI Benchmarking" aggregates 30+ news articles from various countries.