#Benchmarking and Evaluation
Latest news articles tagged with "Benchmarking and Evaluation". Follow the timeline of events, related topics, and entities.
Articles (2)
-
🇺🇸 VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
[USA]
arXiv:2505.15801v4 Announce Type: replace-cross Abstract: Large reasoning models such as OpenAI o1 and DeepSeek-R1 have demonstrated remarkable performance in complex reasoning tasks. A critical comp...
Related: #Large Language Models, #Reinforcement Learning, #Reference‑Based Reward Systems, #AI Alignment -
🇺🇸 EduResearchBench: A Hierarchical Atomic Task Decomposition Benchmark for Full-Lifecycle Educational Research
[USA]
arXiv:2602.15034v1 Announce Type: cross Abstract: While Large Language Models (LLMs) are reshaping the paradigm of AI for Social Science (AI4SS), rigorously evaluating their capabilities in scholarly...
Related: #AI for Social Science (AI4SS), #Large Language Models (LLMs), #Hierarchical Task Decomposition, #Educational Research Methodology