#AI Benchmarking

Latest news articles tagged with "AI Benchmarking". Follow the timeline of events, related topics, and entities.

Articles (30)

🇺🇸 RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark — 23/03/2026 [USA]
arXiv:2509.24897v2 Announce Type: replace Abstract: The integration of visual understanding and generation into unified multimodal models represents a significant stride toward general-purpose AI. Ho...
Related: #Model Unification
🇺🇸 ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models — 23/03/2026 [USA]
arXiv:2603.19515v1 Announce Type: new Abstract: Large language models (LLMs) with advanced cognitive capabilities are emerging as agents for various reasoning and planning tasks. Traditional evaluati...
Related: #Cognitive Planning
🇺🇸 FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment — 23/03/2026 [USA]
arXiv:2603.19539v1 Announce Type: cross Abstract: We introduce an expert curated, real-world benchmark for evaluating document-grounded question-answering (QA) motivated by generic drug assessment, u...
Related: #Drug Regulation
🇺🇸 URAG: A Benchmark for Uncertainty Quantification in Retrieval-Augmented Large Language Models — 23/03/2026 [USA]
arXiv:2603.19281v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for enhancing LLMs in scenarios that demand extensive factual knowledge...
Related: #Uncertainty Quantification
🇺🇸 GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams — 23/03/2026 [USA]
arXiv:2603.19252v1 Announce Type: cross Abstract: Evaluating the symbolic reasoning of large language models (LLMs) calls for geometry benchmarks that require multi-step proofs grounded in both text ...
Related: #Geometric Reasoning
🇺🇸 MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model — 20/03/2026 [USA]
arXiv:2603.18892v1 Announce Type: cross Abstract: Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical en...
Related: #Spatial Reasoning
🇺🇸 FaithSteer-BENCH: A Deployment-Aligned Stress-Testing Benchmark for Inference-Time Steering — 20/03/2026 [USA]
arXiv:2603.18329v1 Announce Type: new Abstract: Inference-time steering is widely regarded as a lightweight and parameter-free mechanism for controlling large language model (LLM) behavior, and prior...
Related: #Inference-Time Steering
🇺🇸 Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning — 20/03/2026 [USA]
arXiv:2603.18662v1 Announce Type: new Abstract: Geometric reasoning inherently requires "thinking with constructions" -- the dynamic manipulation of visual aids to bridge the gap between problem cond...
Related: #Geometric Reasoning
🇺🇸 WeatherReasonSeg: A Benchmark for Weather-Aware Reasoning Segmentation in Visual Language Models — 19/03/2026 [USA]
arXiv:2603.17680v1 Announce Type: cross Abstract: Existing vision-language models (VLMs) have demonstrated impressive performance in reasoning-based segmentation. However, current benchmarks are prim...
Related: #Weather Recognition
🇺🇸 When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents — 19/03/2026 [USA]
arXiv:2603.17104v1 Announce Type: cross Abstract: Current coding-agent benchmarks usually pro- vide the full task specification upfront. Real research coding often does not: the intended system is pr...
Related: #Code Generation
🇺🇸 Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models — 19/03/2026 [USA]
arXiv:2603.16944v1 Announce Type: cross Abstract: While Instruction-based Image Editing (IIE) has achieved significant progress, existing benchmarks pursue task breadth via mixed evaluations. This pa...
Related: #Image Editing
🇺🇸 AIDABench: AI Data Analytics Benchmark — 18/03/2026 [USA]
arXiv:2603.15636v1 Announce Type: new Abstract: As AI-driven document understanding and processing tools become increasingly prevalent in real-world applications, the need for rigorous evaluation sta...
Related: #Data Analytics
🇺🇸 SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models — 18/03/2026 [USA]
arXiv:2603.16859v1 Announce Type: new Abstract: Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM bench...
Related: #Multimodal AI
🇺🇸 360{\deg} Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method — 18/03/2026 [USA]
arXiv:2603.16179v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have shown impressive abilities in understanding and reasoning over conventional images. However, their perc...
Related: #Computer Vision
🇺🇸 VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents — 18/03/2026 [USA]
arXiv:2603.16289v1 Announce Type: cross Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in th...
Related: #Multimodal Agents
🇺🇸 BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs — 18/03/2026 [USA]
arXiv:2603.16557v1 Announce Type: new Abstract: Large language models (LLMs) increasingly store user preferences in persistent memory to support personalization across interactions. However, in third...
Related: #Personalized AI
🇺🇸 V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge in Vision Language Models — 18/03/2026 [USA]
arXiv:2603.16581v1 Announce Type: new Abstract: Vision-Language Models (VLMs) are trained on data snapshots of documents, including images and texts. Their training data and evaluation benchmarks are...
Related: #Temporal Knowledge
🇺🇸 NeSy-Route: A Neuro-Symbolic Benchmark for Constrained Route Planning in Remote Sensing — 18/03/2026 [USA]
arXiv:2603.16307v1 Announce Type: new Abstract: Remote sensing underpins crucial applications such as disaster relief and ecological field surveys, where systems must understand complex scenes and co...
Related: #Remote Sensing
🇺🇸 CUBE: A Standard for Unifying Agent Benchmarks — 18/03/2026 [USA]
arXiv:2603.15798v1 Announce Type: new Abstract: The proliferation of agent benchmarks has created critical fragmentation that threatens research productivity. Each new benchmark requires substantial ...
Related: #Agent Evaluation
🇺🇸 SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding — 18/03/2026 [USA]
arXiv:2603.16124v1 Announce Type: cross Abstract: Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. ...
Related: #Code Understanding
🇺🇸 Benchmarking Zero-Shot Reasoning Approaches for Error Detection in Solidity Smart Contracts — 17/03/2026 [USA]
arXiv:2603.13239v1 Announce Type: new Abstract: Smart contracts play a central role in blockchain systems by encoding financial and operational logic. Still, their susceptibility to subtle security f...
Related: #Blockchain Security
🇺🇸 ManiBench: A Benchmark for Testing Visual-Logic Drift and Syntactic Hallucinations in Manim Code Generation — 17/03/2026 [USA]
arXiv:2603.13251v1 Announce Type: new Abstract: Traditional benchmarks like HumanEval and MBPP test logic and syntax effectively, but fail when code must produce dynamic, pedagogical visuals. We intr...
Related: #Code Generation
🇺🇸 ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation — 16/03/2026 [USA]
arXiv:2603.13154v1 Announce Type: cross Abstract: As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal require...
Related: #ESG Reporting
🇺🇸 NeuCo-Bench: A Novel Benchmark Framework for Neural Embeddings in Earth Observation — 16/03/2026 [USA]
arXiv:2510.17914v2 Announce Type: replace-cross Abstract: We introduce NeuCo-Bench, a novel benchmark framework for evaluating (lossy) neural compression and representation learning in the context of...
Related: #Earth Observation
🇺🇸 OSCBench: Benchmarking Object State Change in Text-to-Video Generation — 13/03/2026 [USA]
arXiv:2603.11698v1 Announce Type: cross Abstract: Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing b...
Related: #Video Generation
🇺🇸 FinRule-Bench: A Benchmark for Joint Reasoning over Financial Tables and Principles — 13/03/2026 [USA]
arXiv:2603.11339v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly applied to financial analysis, yet their ability to audit structured financial statements under explicit ...
Related: #Financial Regulation
🇺🇸 TopoBench: Benchmarking LLMs on Hard Topological Reasoning — 13/03/2026 [USA]
arXiv:2603.12133v1 Announce Type: new Abstract: Solving topological grid puzzles requires reasoning over global spatial invariants such as connectivity, loop closure, and region symmetry and remains ...
Related: #Topological Reasoning
🇺🇸 SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning — 13/03/2026 [USA]
arXiv:2603.12249v1 Announce Type: cross Abstract: Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness...
Related: #Scientific Documents
🇺🇸 BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs — 13/03/2026 [USA]
arXiv:2603.11991v1 Announce Type: cross Abstract: Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable la...
Related: #Text Classification
🇺🇸 Evaluating Progress in Graph Foundation Models: A Comprehensive Benchmark and New Insights — 12/03/2026 [USA]
arXiv:2603.10033v1 Announce Type: cross Abstract: Graph foundation models (GFM) aim to acquire transferable knowledge by pre-training on diverse graphs, which can be adapted to various downstream tas...
Related: #Graph Foundation Models

Key Entities (9)

About the topic: AI Benchmarking

The topic "AI Benchmarking" aggregates 30+ news articles from various countries.