SP
BravenNow
Reasoning-Driven Multimodal LLM for Domain Generalization
| USA | technology | ✓ Verified - arxiv.org

Reasoning-Driven Multimodal LLM for Domain Generalization

📖 Full Retelling

arXiv:2602.23777v1 Announce Type: new Abstract: This paper addresses the domain generalization (DG) problem in deep learning. While most DG methods focus on enforcing visual feature invariance, we leverage the reasoning capability of multimodal large language models (MLLMs) and explore the potential of constructing reasoning chains that derives image categories to achieve more robust predictions under domain shift. To this end, we systematically study the role of reasoning in DG using DomainBed

Entity Intersection Graph

No entity connections available yet for this article.

Original Source
--> Computer Science > Artificial Intelligence arXiv:2602.23777 [Submitted on 27 Feb 2026] Title: Reasoning-Driven Multimodal LLM for Domain Generalization Authors: Zhipeng Xu , Zilong Wang , Xinyang Jiang , Dongsheng Li , De Cheng , Nannan Wang View a PDF of the paper titled Reasoning-Driven Multimodal LLM for Domain Generalization, by Zhipeng Xu and Zilong Wang and Xinyang Jiang and Dongsheng Li and De Cheng and Nannan Wang View PDF HTML Abstract: This paper addresses the domain generalization problem in deep learning. While most DG methods focus on enforcing visual feature invariance, we leverage the reasoning capability of multimodal large language models and explore the potential of constructing reasoning chains that derives image categories to achieve more robust predictions under domain shift. To this end, we systematically study the role of reasoning in DG using DomainBed-Reasoning, a newly constructed extension of DomainBed dataset, in which each sample is paired with class-relevant reasoning chains. Our analysis reveals two key challenges: fine-tuning MLLMs with reasoning chains for classification is more challenging than direct label supervision, since the model must optimize complex reasoning sequences before label prediction ii) mismatches in reasoning patterns between supervision signals and fine-tuned MLLMs lead to a trade-off between semantic richness (informative but harder to optimize) and optimization efficiency (easier to optimize but less informative). To address these issues, we propose RD-MLDG (Reasoning-Driven Multimodal LLM for Domain Generalization), a framework with two components: MTCT (Multi-Task Cross-Training), which introduces an additional direct classification pathway to guide reasoning supervision ii Self-Aligned Reasoning Regularization), which preserves the semantic richness of reasoning chains while mitigating reasoning-pattern mismatches via iterative self-labeling. Experiments on standard DomainBed datasets (PACS, VLCS, Office...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine