#Benchmark Development
Latest news articles tagged with "Benchmark Development". Follow the timeline of events, related topics, and entities.
Articles (13)
-
🇺🇸 FIRE: A Comprehensive Benchmark for Financial Intelligence and Reasoning Evaluation
[USA]
arXiv:2602.22273v1 Announce Type: new Abstract: We introduce FIRE, a comprehensive benchmark designed to evaluate both the theoretical financial knowledge of LLMs and their ability to handle practica...
Related: #Artificial Intelligence, #Financial Technology, #Evaluation Methodology -
🇺🇸 LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification
[USA]
arXiv:2602.21044v1 Announce Type: new Abstract: Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct pr...
Related: #Artificial Intelligence, #Logical Reasoning, #Neuro-Symbolic Systems -
🇺🇸 CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation
[USA]
arXiv:2602.20170v1 Announce Type: cross Abstract: Existing red-teaming benchmarks, when adapted to new languages via direct translation, fail to capture socio-technical vulnerabilities rooted in loca...
Related: #Artificial Intelligence Safety, #Cultural Adaptation -
🇺🇸 How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective
[USA]
arXiv:2602.20687v1 Announce Type: new Abstract: Recent advances in vision-language models (VLMs) have shown promise for human-level embodied intelligence. However, existing benchmarks for VLM-driven ...
Related: #Artificial Intelligence, #Embodied Intelligence -
🇺🇸 A Benchmark for Deep Information Synthesis
[USA]
arXiv:2602.21143v1 Announce Type: new Abstract: Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data...
Related: #AI Evaluation, #Information Synthesis -
🇺🇸 AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks
[USA]
arXiv:2602.16901v1 Announce Type: new Abstract: LLM agents are increasingly deployed in long-horizon, complex environments to solve challenging problems, but this expansion exposes them to long-horiz...
Related: #Artificial Intelligence Security, #Large Language Model Agents, #Long‑Horizon Attack Vectors, #Multi‑Turn Interaction Vulnerabilities -
🇺🇸 Defining and Evaluating Physical Safety for Large Language Models
[USA]
arXiv:2411.02317v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly used to control robotic systems such as drones, but their risks of causing physical threats and...
Related: #Large Language Model Safety, #Robotic System Control, #Prompt Engineering, #Regulatory Compliance -
🇺🇸 GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning
[USA]
arXiv:2507.03267v2 Announce Type: replace Abstract: Dynamic Text-Attributed Graphs (DyTAGs), which intricately integrate structural, temporal, and textual attributes, are crucial for modeling complex...
Related: #Graph Neural Networks, #Dynamic Graphs, #Text‑Attributed Graphs, #Generative Modeling -
🇺🇸 EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition
[USA]
arXiv:2602.12919v1 Announce Type: cross Abstract: Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventio...
Related: #Computer Vision, #Event-Based Imaging -
🇺🇸 RAT-Bench: A Comprehensive Benchmark for Text Anonymization
[USA]
arXiv:2602.12806v1 Announce Type: cross Abstract: Data containing personal information is increasingly used to train, fine-tune, or query Large Language Models (LLMs). Text is typically scrubbed of i...
Related: #Privacy and Data Protection, #Large Language Models, #Re‑identification Risk -
🇺🇸 GISA: A Benchmark for General Information-Seeking Assistant
[USA]
arXiv:2602.08543v2 Announce Type: replace-cross Abstract: The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gather...
Related: #Artificial Intelligence, #Information Retrieval -
🇺🇸 VoiceAgentBench: Are Voice Assistants ready for agentic tasks?
[USA]
arXiv:2510.07978v3 Announce Type: replace Abstract: Large scale Speech Language Models have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. Howe...
Related: #AI Evaluation, #Speech Technology -
🇺🇸 Evaluating Robustness of Reasoning Models on Parameterized Logical Problems
[USA]
arXiv:2602.12665v1 Announce Type: new Abstract: Logic provides a controlled testbed for evaluating LLM-based reasoners, yet standard SAT-style benchmarks often conflate surface difficulty (length, wo...
Related: #AI Evaluation, #Logical Reasoning