Language-Guided Invariance Probing of Vision-Language Models
#Vision-Language Models #Language-Guided Invariance Probing #LGIP benchmark #CLIP #Zero-shot learning #Linguistic perturbations #Semantic robustness #MS COCO dataset
📌 Key Takeaways
- Researchers developed LGIP benchmark to evaluate vision-language models' response to linguistic perturbations
- The benchmark tests both invariance to paraphrases and sensitivity to semantic flips
- LGIP uses 40k MS COCO images with five linguistic variations for comprehensive testing
- Current VLMs show strong zero-shot performance but haven't been thoroughly tested for linguistic robustness
📖 Full Retelling
Researchers at an academic institution announced a new benchmark called Language-Guided Invariance Probing (LGIP) on November 23, 2025, to evaluate how vision-language models respond to linguistic perturbations. The benchmark measures invariance to meaning-preserving paraphrases and sensitivity to meaning-changing semantic flips in image-text matching using 40k MS COCO images with five different linguistic variations. The development of LGIP comes as vision-language models like CLIP, OpenCLIP, EVA02-CLIP and SigLIP have demonstrated strong zero-shot performance but have not been thoroughly tested for their reliability when faced with controlled linguistic changes. The research team recognized this gap in evaluation methodologies and created LGIP to provide a standardized way to assess how these models handle different formulations of the same semantic content. By testing against both meaning-preserving paraphrases and meaning-changing semantic flips, the benchmark offers comprehensive insights into the linguistic robustness of current VLMs. The LGIP benchmark utilizes the large-scale MS COCO dataset, which contains 40,000 images, each paired with five different linguistic variations, allowing researchers to systematically evaluate whether vision-language models maintain consistent performance when semantic content remains unchanged but wording differs, and whether they can detect when semantic meaning has been altered.
🏷️ Themes
Artificial Intelligence, Natural Language Processing, Computer Vision, Model Evaluation
📚 Related People & Topics
Entity Intersection Graph
Connections for Clip:
🌐
Multimodal learning
2 shared
Original Source
arXiv:2511.13494v1 Announce Type: cross
Abstract: Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with fiv
Read full article at source