SP
BravenNow
Language-Guided Invariance Probing of Vision-Language Models
| USA | technology | ✓ Verified - arxiv.org

Language-Guided Invariance Probing of Vision-Language Models

#Vision-Language Models #Language-Guided Invariance Probing #LGIP benchmark #CLIP #Zero-shot learning #Linguistic perturbations #Semantic robustness #MS COCO dataset

📌 Key Takeaways

  • Researchers developed LGIP benchmark to evaluate vision-language models' response to linguistic perturbations
  • The benchmark tests both invariance to paraphrases and sensitivity to semantic flips
  • LGIP uses 40k MS COCO images with five linguistic variations for comprehensive testing
  • Current VLMs show strong zero-shot performance but haven't been thoroughly tested for linguistic robustness

📖 Full Retelling

Researchers at an academic institution announced a new benchmark called Language-Guided Invariance Probing (LGIP) on November 23, 2025, to evaluate how vision-language models respond to linguistic perturbations. The benchmark measures invariance to meaning-preserving paraphrases and sensitivity to meaning-changing semantic flips in image-text matching using 40k MS COCO images with five different linguistic variations. The development of LGIP comes as vision-language models like CLIP, OpenCLIP, EVA02-CLIP and SigLIP have demonstrated strong zero-shot performance but have not been thoroughly tested for their reliability when faced with controlled linguistic changes. The research team recognized this gap in evaluation methodologies and created LGIP to provide a standardized way to assess how these models handle different formulations of the same semantic content. By testing against both meaning-preserving paraphrases and meaning-changing semantic flips, the benchmark offers comprehensive insights into the linguistic robustness of current VLMs. The LGIP benchmark utilizes the large-scale MS COCO dataset, which contains 40,000 images, each paired with five different linguistic variations, allowing researchers to systematically evaluate whether vision-language models maintain consistent performance when semantic content remains unchanged but wording differs, and whether they can detect when semantic meaning has been altered.

🏷️ Themes

Artificial Intelligence, Natural Language Processing, Computer Vision, Model Evaluation

📚 Related People & Topics

Clip

Topics referred to by the same term

Clip or CLIP may refer to:

View Profile → Wikipedia ↗

Entity Intersection Graph

Connections for Clip:

🌐 Multimodal learning 2 shared
View full profile
Original Source
arXiv:2511.13494v1 Announce Type: cross Abstract: Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with fiv
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine