MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering
#Knowledge‑Based Visual Question Answering #MaS‑VQA #Mask‑and‑Select Framework #External Knowledge Retrieval #Internal Model Knowledge #Reasoning Effectiveness #Answer Accuracy #NOI
📌 Key Takeaways
- KB‑VQA requires combining visual inputs with external knowledge to answer questions.
- Retrieved knowledge tends to be noisy, partially irrelevant, or misaligned with the visual content.
- Internal model knowledge is difficult to control and interpret, reducing transparency.
- Naive aggregation of external and internal knowledge limits reasoning effectiveness and accuracy.
- MaS‑VQA introduces a Mask‑and‑Select strategy to reduce noise and improve interpretability.
📖 Full Retelling
The authors present MaS-VQA, a Mask‑and‑Select framework for Knowledge‑Based Visual Question Answering (KB‑VQA). It is described in a research preprint posted on arXiv in February 2026, aimed at refining the integration of visual data and external knowledge sources. The motivation is to mitigate the common issues of noisy, irrelevant, or misaligned retrieved knowledge and opaque internal model knowledge, which together limit reasoning effectiveness and degrade answer accuracy.
🏷️ Themes
Visual Question Answering, Knowledge Integration, Noise Reduction, Model Interpretability, Cross‑modal Reasoning
Entity Intersection Graph
No entity connections available yet for this article.
Original Source
arXiv:2602.15915v1 Announce Type: cross
Abstract: Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-
Read full article at source