#Benchmark Development

Latest news articles tagged with "Benchmark Development". Follow the timeline of events, related topics, and entities.

Articles (16)

🇺🇸 ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts — 01/04/2026 [USA]
arXiv:2603.28902v1 Announce Type: new Abstract: Charts are central to analytical reasoning, yet existing benchmarks for chart understanding focus almost exclusively on single-chart interpretation rat...
Related: #Artificial Intelligence, #Data Visualization, #Comparative Analysis
🇺🇸 IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch — 12/03/2026 [USA]
arXiv:2512.00997v2 Announce Type: replace Abstract: Reliable autoformalization remains challenging even in the era of large language models (LLMs). The scarcity of high-quality training data is a maj...
Related: #Mathematical Reasoning
🇺🇸 TraderBench: How Robust Are AI Agents in Adversarial Capital Markets? — 03/03/2026 [USA]
arXiv:2603.00285v1 Announce Type: new Abstract: Evaluating AI agents in finance faces two key challenges: static benchmarks require costly expert annotation yet miss the dynamic decision-making centr...
Related: #AI Evaluation, #Financial Technology
🇺🇸 FIRE: A Comprehensive Benchmark for Financial Intelligence and Reasoning Evaluation — 27/02/2026 [USA]
arXiv:2602.22273v1 Announce Type: new Abstract: We introduce FIRE, a comprehensive benchmark designed to evaluate both the theoretical financial knowledge of LLMs and their ability to handle practica...
Related: #Artificial Intelligence, #Financial Technology, #Evaluation Methodology
🇺🇸 CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation — 25/02/2026 [USA]
arXiv:2602.20170v1 Announce Type: cross Abstract: Existing red-teaming benchmarks, when adapted to new languages via direct translation, fail to capture socio-technical vulnerabilities rooted in loca...
Related: #Artificial Intelligence Safety, #Cultural Adaptation
🇺🇸 How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective — 25/02/2026 [USA]
arXiv:2602.20687v1 Announce Type: new Abstract: Recent advances in vision-language models (VLMs) have shown promise for human-level embodied intelligence. However, existing benchmarks for VLM-driven ...
Related: #Artificial Intelligence, #Embodied Intelligence
🇺🇸 LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification — 25/02/2026 [USA]
arXiv:2602.21044v1 Announce Type: new Abstract: Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct pr...
Related: #Artificial Intelligence, #Logical Reasoning, #Neuro-Symbolic Systems
🇺🇸 A Benchmark for Deep Information Synthesis — 25/02/2026 [USA]
arXiv:2602.21143v1 Announce Type: new Abstract: Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data...
Related: #AI Evaluation, #Information Synthesis
🇺🇸 Defining and Evaluating Physical Safety for Large Language Models — 20/02/2026 [USA]
arXiv:2411.02317v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly used to control robotic systems such as drones, but their risks of causing physical threats and...
Related: #Large Language Model Safety, #Robotic System Control, #Prompt Engineering, #Regulatory Compliance
🇺🇸 AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks — 20/02/2026 [USA]
arXiv:2602.16901v1 Announce Type: new Abstract: LLM agents are increasingly deployed in long-horizon, complex environments to solve challenging problems, but this expansion exposes them to long-horiz...
Related: #Artificial Intelligence Security, #Large Language Model Agents, #Long‑Horizon Attack Vectors, #Multi‑Turn Interaction Vulnerabilities
🇺🇸 GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning — 19/02/2026 [USA]
arXiv:2507.03267v2 Announce Type: replace Abstract: Dynamic Text-Attributed Graphs (DyTAGs), which intricately integrate structural, temporal, and textual attributes, are crucial for modeling complex...
Related: #Graph Neural Networks, #Dynamic Graphs, #Text‑Attributed Graphs, #Generative Modeling
🇺🇸 Evaluating Robustness of Reasoning Models on Parameterized Logical Problems — 16/02/2026 [USA]
arXiv:2602.12665v1 Announce Type: new Abstract: Logic provides a controlled testbed for evaluating LLM-based reasoners, yet standard SAT-style benchmarks often conflate surface difficulty (length, wo...
Related: #AI Evaluation, #Logical Reasoning
🇺🇸 GISA: A Benchmark for General Information-Seeking Assistant — 16/02/2026 [USA]
arXiv:2602.08543v2 Announce Type: replace-cross Abstract: The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gather...
Related: #Artificial Intelligence, #Information Retrieval
🇺🇸 RAT-Bench: A Comprehensive Benchmark for Text Anonymization — 16/02/2026 [USA]
arXiv:2602.12806v1 Announce Type: cross Abstract: Data containing personal information is increasingly used to train, fine-tune, or query Large Language Models (LLMs). Text is typically scrubbed of i...
Related: #Privacy and Data Protection, #Large Language Models, #Re‑identification Risk
🇺🇸 EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition — 16/02/2026 [USA]
arXiv:2602.12919v1 Announce Type: cross Abstract: Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventio...
Related: #Computer Vision, #Event-Based Imaging
🇺🇸 VoiceAgentBench: Are Voice Assistants ready for agentic tasks? — 16/02/2026 [USA]
arXiv:2510.07978v3 Announce Type: replace Abstract: Large scale Speech Language Models have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. Howe...
Related: #AI Evaluation, #Speech Technology

Key Entities (11)

Large language model (4 news)
Benchmark (3 news)
Artificial intelligence (1 news)
Financial market (1 news)
AI agent (1 news)
Human Touch (1 news)
Data and information visualization (1 news)
Financial intelligence (1 news)
Visual place recognition (1 news)
Computer vision (1 news)
Logical reasoning (1 news)

About the topic: Benchmark Development

The topic "Benchmark Development" aggregates 16+ news articles from various countries.