#LLM Evaluation

Latest news articles tagged with "LLM Evaluation". Follow the timeline of events, related topics, and entities.

Articles (7)

🇺🇸 The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs — 16/03/2026 [USA]
arXiv:2509.09677v3 Announce Type: replace Abstract: Does continued scaling of large language models (LLMs) yield diminishing returns? In this work, we show that short-task benchmarks may give an illu...
Related: #AI Performance
🇺🇸 OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always! — 16/03/2026 [USA]
arXiv:2509.26495v3 Announce Type: replace Abstract: Large Language Model (LLM) safety is one of the most pressing challenges for enabling wide-scale deployment. While most studies and global discussi...
Related: #AI Safety
🇺🇸 LLM BiasScope: A Real-Time Bias Analysis Platform for Comparative LLM Evaluation — 16/03/2026 [USA]
arXiv:2603.12522v1 Announce Type: cross Abstract: As large language models (LLMs) are deployed widely, detecting and understanding bias in their outputs is critical. We present LLM BiasScope, a web a...
Related: #AI Bias
🇺🇸 SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks — 12/03/2026 [USA]
arXiv:2603.10002v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly tasked with producing and manipulating structured artifacts. We consider the task of end-to-end spreads...
Related: #Spreadsheet Generation
🇺🇸 Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives — 12/03/2026 [USA]
arXiv:2603.09994v1 Announce Type: cross Abstract: Compositionality is considered central to language abilities. As performant language systems, how do large language models (LLMs) do on compositional...
Related: #Semantic Compositionality
🇺🇸 Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation — 06/03/2026 [USA]
arXiv:2603.05485v1 Announce Type: new Abstract: As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be util...
Related: #AI Fairness
🇺🇸 IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models — 19/02/2026 [USA]
arXiv:2602.16467v1 Announce Type: cross Abstract: The rapid advancement of large language models (LLMs) necessitates evaluation frameworks that reflect real-world academic rigor and multilingual comp...
Related: #Multilingual Benchmarking, #Indian Educational Standards, #High‑stakes Testing, #STEM and Humanities Assessment

Key Entities (2)

Large language model (1 news)
AI safety (1 news)

About the topic: LLM Evaluation

The topic "LLM Evaluation" aggregates 7+ news articles from various countries.