#Model Interpretability

Latest news articles tagged with "Model Interpretability". Follow the timeline of events, related topics, and entities.

Articles (7)

🇺🇸 MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering — 19/02/2026 [USA]
arXiv:2602.15915v1 Announce Type: cross Abstract: Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. How...
Related: #Visual Question Answering, #Knowledge Integration, #Noise Reduction, #Cross‑modal Reasoning
🇺🇸 Batch-CAM: Introduction to better reasoning in convolutional deep learning models — 16/02/2026 [USA]
arXiv:2510.00664v2 Announce Type: replace Abstract: Deep learning opacity often impedes deployment in high-stakes domains. We propose a training framework that aligns model focus with class-represent...
Related: #AI Transparency, #Deep Learning
🇺🇸 CodeCircuit: Toward Inferring LLM-Generated Code Correctness via Attribution Graphs — 10/02/2026 [USA]
arXiv:2602.07080v1 Announce Type: cross Abstract: Current paradigms for code verification rely heavily on external mechanisms-such as execution-based unit tests or auxiliary LLM judges-which are ofte...
Related: #Artificial Intelligence, #Software Engineering
🇺🇸 Endogenous Resistance to Activation Steering in Language Models — 09/02/2026 [USA]
arXiv:2602.06941v1 Announce Type: cross Abstract: Large language models can resist task-misaligned activation steering during inference, sometimes recovering mid-generation to produce improved respon...
Related: #Artificial Intelligence, #AI Safety
🇺🇸 On the Identifiability of Steering Vectors in Large Language Models — 09/02/2026 [USA]
arXiv:2602.06801v1 Announce Type: cross Abstract: Activation steering methods, such as persona vectors, are widely used to control large language model behavior and increasingly interpreted as reveal...
Related: #Artificial Intelligence, #Technology
🇺🇸 DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders — 07/02/2026 [USA]
arXiv:2602.05859v1 Announce Type: cross Abstract: Sparse autoencoders (SAEs) have become a standard tool for mechanistic interpretability in autoregressive large language models (LLMs), enabling rese...
Related: #Artificial Intelligence, #Machine Learning
🇺🇸 Evaluating chain-of-thought monitorability — 18/12/2025 [USA]
OpenAI introduces a new framework and evaluation suite for chain-of-thought monitorability, covering 13 evaluations across 24 environments. Our findings show that monitoring a model’s internal reasoni...
Related: #AI Safety, #Technical Innovation