#Benchmark Development
Latest news articles tagged with "Benchmark Development". Follow the timeline of events, related topics, and entities.
Articles (16)
-
๐บ๐ธ ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts
[USA]
arXiv:2603.28902v1 Announce Type: new Abstract: Charts are central to analytical reasoning, yet existing benchmarks for chart understanding focus almost exclusively on single-chart interpretation rat...
Related: #Artificial Intelligence, #Data Visualization, #Comparative Analysis -
๐บ๐ธ IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch
[USA]
arXiv:2512.00997v2 Announce Type: replace Abstract: Reliable autoformalization remains challenging even in the era of large language models (LLMs). The scarcity of high-quality training data is a maj...
Related: #Mathematical Reasoning -
๐บ๐ธ TraderBench: How Robust Are AI Agents in Adversarial Capital Markets?
[USA]
arXiv:2603.00285v1 Announce Type: new Abstract: Evaluating AI agents in finance faces two key challenges: static benchmarks require costly expert annotation yet miss the dynamic decision-making centr...
Related: #AI Evaluation, #Financial Technology -
๐บ๐ธ FIRE: A Comprehensive Benchmark for Financial Intelligence and Reasoning Evaluation
[USA]
arXiv:2602.22273v1 Announce Type: new Abstract: We introduce FIRE, a comprehensive benchmark designed to evaluate both the theoretical financial knowledge of LLMs and their ability to handle practica...
Related: #Artificial Intelligence, #Financial Technology, #Evaluation Methodology -
๐บ๐ธ CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation
[USA]
arXiv:2602.20170v1 Announce Type: cross Abstract: Existing red-teaming benchmarks, when adapted to new languages via direct translation, fail to capture socio-technical vulnerabilities rooted in loca...
Related: #Artificial Intelligence Safety, #Cultural Adaptation -
๐บ๐ธ How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective
[USA]
arXiv:2602.20687v1 Announce Type: new Abstract: Recent advances in vision-language models (VLMs) have shown promise for human-level embodied intelligence. However, existing benchmarks for VLM-driven ...
Related: #Artificial Intelligence, #Embodied Intelligence -
๐บ๐ธ LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification
[USA]
arXiv:2602.21044v1 Announce Type: new Abstract: Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct pr...
Related: #Artificial Intelligence, #Logical Reasoning, #Neuro-Symbolic Systems -
๐บ๐ธ A Benchmark for Deep Information Synthesis
[USA]
arXiv:2602.21143v1 Announce Type: new Abstract: Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data...
Related: #AI Evaluation, #Information Synthesis -
๐บ๐ธ Defining and Evaluating Physical Safety for Large Language Models
[USA]
arXiv:2411.02317v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly used to control robotic systems such as drones, but their risks of causing physical threats and...
Related: #Large Language Model Safety, #Robotic System Control, #Prompt Engineering, #Regulatory Compliance -
๐บ๐ธ AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks
[USA]
arXiv:2602.16901v1 Announce Type: new Abstract: LLM agents are increasingly deployed in long-horizon, complex environments to solve challenging problems, but this expansion exposes them to long-horiz...
Related: #Artificial Intelligence Security, #Large Language Model Agents, #LongโHorizon Attack Vectors, #MultiโTurn Interaction Vulnerabilities -
๐บ๐ธ GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning
[USA]
arXiv:2507.03267v2 Announce Type: replace Abstract: Dynamic Text-Attributed Graphs (DyTAGs), which intricately integrate structural, temporal, and textual attributes, are crucial for modeling complex...
Related: #Graph Neural Networks, #Dynamic Graphs, #TextโAttributed Graphs, #Generative Modeling -
๐บ๐ธ Evaluating Robustness of Reasoning Models on Parameterized Logical Problems
[USA]
arXiv:2602.12665v1 Announce Type: new Abstract: Logic provides a controlled testbed for evaluating LLM-based reasoners, yet standard SAT-style benchmarks often conflate surface difficulty (length, wo...
Related: #AI Evaluation, #Logical Reasoning -
๐บ๐ธ GISA: A Benchmark for General Information-Seeking Assistant
[USA]
arXiv:2602.08543v2 Announce Type: replace-cross Abstract: The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gather...
Related: #Artificial Intelligence, #Information Retrieval -
๐บ๐ธ RAT-Bench: A Comprehensive Benchmark for Text Anonymization
[USA]
arXiv:2602.12806v1 Announce Type: cross Abstract: Data containing personal information is increasingly used to train, fine-tune, or query Large Language Models (LLMs). Text is typically scrubbed of i...
Related: #Privacy and Data Protection, #Large Language Models, #Reโidentification Risk -
๐บ๐ธ EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition
[USA]
arXiv:2602.12919v1 Announce Type: cross Abstract: Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventio...
Related: #Computer Vision, #Event-Based Imaging -
๐บ๐ธ VoiceAgentBench: Are Voice Assistants ready for agentic tasks?
[USA]
arXiv:2510.07978v3 Announce Type: replace Abstract: Large scale Speech Language Models have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. Howe...
Related: #AI Evaluation, #Speech Technology
Key Entities (11)
- Large language model (4 news)
- Benchmark (3 news)
- Artificial intelligence (1 news)
- Financial market (1 news)
- AI agent (1 news)
- Human Touch (1 news)
- Data and information visualization (1 news)
- Financial intelligence (1 news)
- Visual place recognition (1 news)
- Computer vision (1 news)
- Logical reasoning (1 news)
About the topic: Benchmark Development
The topic "Benchmark Development" aggregates 16+ news articles from various countries.