#Model Evaluation
Latest news articles tagged with "Model Evaluation". Follow the timeline of events, related topics, and entities.
Articles (5)
-
πΊπΈ VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation
[USA]
arXiv:2602.21054v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation ...
Related: #Artificial Intelligence, #Computer Vision, #AI Safety -
πΊπΈ Enhancing Large Language Models (LLMs) for Telecom using Dynamic Knowledge Graphs and Explainable Retrieval-Augmented Generation
[USA]
arXiv:2602.17529v1 Announce Type: new Abstract: Large language models (LLMs) have shown strong potential across a variety of tasks, but their application in the telecom field remains challenging due ...
Related: #Artificial Intelligence, #Natural Language Processing, #Telecom Engineering, #Knowledge Graphs -
πΊπΈ Language-Guided Invariance Probing of Vision-Language Models
[USA]
arXiv:2511.13494v1 Announce Type: cross Abstract: Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliab...
Related: #Artificial Intelligence, #Natural Language Processing, #Computer Vision -
πΊπΈ RealFin: How Well Do LLMs Reason About Finance When Users Leave Things Unsaid?
[USA]
arXiv:2602.07096v1 Announce Type: cross Abstract: Reliable financial reasoning requires knowing not only how to answer, but also when an answer cannot be justified. In real financial practice, proble...
Related: #Artificial Intelligence, #FinTech -
πΊπΈ Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations
[USA]
arXiv:2602.05523v1 Announce Type: cross Abstract: Agentic large language models (LLMs) are increasingly evaluated on cybersecurity tasks using capture-the-flag (CTF) benchmarks. However, existing poi...
Related: #Artificial Intelligence, #Cybersecurity