#LLM Evaluation
Latest news articles tagged with "LLM Evaluation". Follow the timeline of events, related topics, and entities.
Articles (7)
-
๐บ๐ธ The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
[USA]
arXiv:2509.09677v3 Announce Type: replace Abstract: Does continued scaling of large language models (LLMs) yield diminishing returns? In this work, we show that short-task benchmarks may give an illu...
Related: #AI Performance -
๐บ๐ธ OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!
[USA]
arXiv:2509.26495v3 Announce Type: replace Abstract: Large Language Model (LLM) safety is one of the most pressing challenges for enabling wide-scale deployment. While most studies and global discussi...
Related: #AI Safety -
๐บ๐ธ LLM BiasScope: A Real-Time Bias Analysis Platform for Comparative LLM Evaluation
[USA]
arXiv:2603.12522v1 Announce Type: cross Abstract: As large language models (LLMs) are deployed widely, detecting and understanding bias in their outputs is critical. We present LLM BiasScope, a web a...
Related: #AI Bias -
๐บ๐ธ SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks
[USA]
arXiv:2603.10002v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly tasked with producing and manipulating structured artifacts. We consider the task of end-to-end spreads...
Related: #Spreadsheet Generation -
๐บ๐ธ Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives
[USA]
arXiv:2603.09994v1 Announce Type: cross Abstract: Compositionality is considered central to language abilities. As performant language systems, how do large language models (LLMs) do on compositional...
Related: #Semantic Compositionality -
๐บ๐ธ Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation
[USA]
arXiv:2603.05485v1 Announce Type: new Abstract: As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be util...
Related: #AI Fairness -
๐บ๐ธ IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models
[USA]
arXiv:2602.16467v1 Announce Type: cross Abstract: The rapid advancement of large language models (LLMs) necessitates evaluation frameworks that reflect real-world academic rigor and multilingual comp...
Related: #Multilingual Benchmarking, #Indian Educational Standards, #Highโstakes Testing, #STEM and Humanities Assessment
Key Entities (2)
- Large language model (1 news)
- AI safety (1 news)
About the topic: LLM Evaluation
The topic "LLM Evaluation" aggregates 7+ news articles from various countries.