#Model Evaluation

Latest news articles tagged with "Model Evaluation". Follow the timeline of events, related topics, and entities.

Articles (5)

🇺🇸 VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation — 25/02/2026 [USA]
arXiv:2602.21054v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation ...
Related: #Artificial Intelligence, #Computer Vision, #AI Safety
🇺🇸 Enhancing Large Language Models (LLMs) for Telecom using Dynamic Knowledge Graphs and Explainable Retrieval-Augmented Generation — 20/02/2026 [USA]
arXiv:2602.17529v1 Announce Type: new Abstract: Large language models (LLMs) have shown strong potential across a variety of tasks, but their application in the telecom field remains challenging due ...
Related: #Artificial Intelligence, #Natural Language Processing, #Telecom Engineering, #Knowledge Graphs
🇺🇸 Language-Guided Invariance Probing of Vision-Language Models — 16/02/2026 [USA]
arXiv:2511.13494v1 Announce Type: cross Abstract: Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliab...
Related: #Artificial Intelligence, #Natural Language Processing, #Computer Vision
🇺🇸 RealFin: How Well Do LLMs Reason About Finance When Users Leave Things Unsaid? — 10/02/2026 [USA]
arXiv:2602.07096v1 Announce Type: cross Abstract: Reliable financial reasoning requires knowing not only how to answer, but also when an answer cannot be justified. In real financial practice, proble...
Related: #Artificial Intelligence, #FinTech
🇺🇸 Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations — 07/02/2026 [USA]
arXiv:2602.05523v1 Announce Type: cross Abstract: Agentic large language models (LLMs) are increasingly evaluated on cybersecurity tasks using capture-the-flag (CTF) benchmarks. However, existing poi...
Related: #Artificial Intelligence, #Cybersecurity