#Benchmark development

Latest news articles tagged with "Benchmark development". Follow the timeline of events, related topics, and entities.

Articles (4)

🇺🇸 SourceBench: Can AI Answers Reference Quality Web Sources? — 20/02/2026 [USA]
arXiv:2602.16942v1 Announce Type: new Abstract: Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evide...
Related: #Artificial intelligence evaluation, #Web search and information retrieval, #Source quality assessment, #Human‑in‑the‑loop evaluation
🇺🇸 EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery — 19/02/2026 [USA]
arXiv:2602.15918v1 Announce Type: cross Abstract: Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance fo...
Related: #Spatial reasoning, #Multimodal Large Language Models, #Earth imagery, #Embodied AI
🇺🇸 OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety — 18/02/2026 [USA]
arXiv:2507.06134v2 Announce Type: replace Abstract: Recent advances in AI agents capable of solving complex, everyday tasks, from scheduling to customer service, have enabled deployment in real-world...
Related: #Artificial‑intelligence safety, #Real‑world AI agent evaluation, #Tool abstraction in AI, #Methodological rigor
🇺🇸 ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models — 18/02/2026 [USA]
arXiv:2602.15758v1 Announce Type: cross Abstract: While Multimodal Large Language Models (MLLMs) perform strongly on single-turn chart generation, their ability to support real-world exploratory data...
Related: #Multimodal artificial intelligence, #Data visualization, #Iterative human‑machine interaction, #Exploratory data analysis