#Evaluation methodology
Latest news articles tagged with "Evaluation methodology". Follow the timeline of events, related topics, and entities.
Articles (3)
-
🇺🇸 Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents
[USA]
arXiv:2602.16943v1 Announce Type: new Abstract: Large language models deployed as agents increasingly interact with external systems through tool calls--actions with real-world consequences that text...
Related: #Artificial intelligence safety, #Model alignment, #Regulated domain compliance, #Prompt engineering -
🇺🇸 Simple Baselines are Competitive with Code Evolution
[USA]
arXiv:2602.16805v1 Announce Type: new Abstract: Code evolution is a family of techniques that rely on large language models to search through possible computer programs by evolving or mutating existi...
Related: #Search‑space and domain‑knowledge design, #Baseline versus advanced technique comparison, #Research practices in code evolution, #Variance and dataset size effects -
🇺🇸 DECKBench: Benchmarking Multi-Agent Frameworks for Academic Slide Generation and Editing
[USA]
arXiv:2602.13318v1 Announce Type: new Abstract: Automatically generating and iteratively editing academic slide decks requires more than document summarization. It demands faithful content selection,...
Related: #Benchmarking of AI systems, #Multi‑agent workflow design, #Academic content creation, #Natural language processing