How Do Inpainting Artifacts Propagate to Language?
#inpainting artifacts #vision-language models #diffusion models #image reconstruction #language generation #multimodal systems #computer vision
📌 Key Takeaways
- Researchers studied how visual artifacts from inpainting affect language generation in vision-language models
- A two-stage diagnostic setup enabled controlled comparisons between original and reconstructed image captions
- Strong correlations were found between reconstruction quality and caption performance
- Inpainting artifacts cause systematic, layer-dependent changes in model behavior
- The research provides a diagnostic framework for improving multimodal systems
📖 Full Retelling
Researchers Pratham Yashwante, Davit Abrahamyan, Shresth Grover, and Sukruth Rao published a study on arXiv on February 24, 2026, investigating how visual artifacts from diffusion-based inpainting affect language generation in vision-language models, aiming to understand the relationship between visual reconstruction quality and downstream caption performance in multimodal systems. The researchers employed a two-stage diagnostic methodology where masked image regions were first reconstructed using diffusion-based techniques and then provided to captioning models, enabling direct comparisons between captions generated from original versus reconstructed inputs. Across multiple datasets, the team analyzed the correlation between reconstruction fidelity and caption quality, discovering consistent associations between pixel-level and perceptual reconstruction metrics and both lexical and semantic captioning performance. Additional examination of intermediate visual representations and attention patterns revealed that inpainting artifacts induce systematic, layer-dependent changes in model behavior, providing valuable insights into how vision-language models process and interpret visually reconstructed information.
🏷️ Themes
Computer Vision, Artificial Intelligence, Multimodal Systems, Image Reconstruction
Entity Intersection Graph
No entity connections available yet for this article.
Original Source
--> Computer Science > Computer Vision and Pattern Recognition arXiv:2602.20520 [Submitted on 24 Feb 2026] Title: How Do Inpainting Artifacts Propagate to Yashwante , Davit Abrahamyan , Shresth Grover , Sukruth Rao View a PDF of the paper titled How Do Inpainting Artifacts Propagate to Language?, by Pratham Yashwante and 3 other authors View PDF HTML Abstract: We study how visual artifacts introduced by diffusion-based inpainting affect language generation in vision-language models. We use a two-stage diagnostic setup in which masked image regions are reconstructed and then provided to captioning models, enabling controlled comparisons between captions generated from original and reconstructed inputs. Across multiple datasets, we analyze the relationship between reconstruction fidelity and downstream caption quality. We observe consistent associations between pixel-level and perceptual reconstruction metrics and both lexical and semantic captioning performance. Additional analysis of intermediate visual representations and attention patterns shows that inpainting artifacts lead to systematic, layer-dependent changes in model behavior. Together, these results provide a practical diagnostic framework for examining how visual reconstruction quality influences language generation in multimodal systems. Subjects: Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI) Cite as: arXiv:2602.20520 [cs.CV] (or arXiv:2602.20520v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.20520 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Pratham Yashwante [ view email ] [v1] Tue, 24 Feb 2026 03:46:33 UTC (6,946 KB) Full-text links: Access Paper: View a PDF of the paper titled How Do Inpainting Artifacts Propagate to Language?, by Pratham Yashwante and 3 other authors View PDF HTML TeX Source view license Current browse context: cs.CV < prev | next > new | recent | 2026-02 Change to browse by: cs ...
Read full article at source