#AI Interpretability

Latest news articles tagged with "AI Interpretability". Follow the timeline of events, related topics, and entities.

Articles (15)

🇺🇸 ConceptTracer: Interactive Analysis of Concept Saliency and Selectivity in Neural Representations — 09/04/2026 [USA]
arXiv:2604.07019v1 Announce Type: cross Abstract: Neural networks deliver impressive predictive performance across a variety of tasks, but they are often opaque in their decision-making processes. De...
Related: #Machine Learning, #Research Tool
🇺🇸 WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior — 20/03/2026 [USA]
arXiv:2603.18474v1 Announce Type: cross Abstract: Precise behavioral control of large language models (LLMs) is critical for complex applications. However, existing methods often incur high training ...
Related: #Neural Networks
🇺🇸 Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models — 20/03/2026 [USA]
arXiv:2603.18523v1 Announce Type: cross Abstract: Counting serves as a simple but powerful test of a Large Vision-Language Model's (LVLM's) reasoning; it forces the model to identify each individual ...
Related: #Visual Reasoning
🇺🇸 Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations — 20/03/2026 [USA]
arXiv:2603.18353v1 Announce Type: new Abstract: Language models encode task-relevant knowledge in internal representations that far exceeds their output performance, but whether mechanistic interpret...
Related: #Model Errors
🇺🇸 DreamReader: An Interpretability Toolkit for Text-to-Image Models — 17/03/2026 [USA]
arXiv:2603.13299v1 Announce Type: cross Abstract: Despite the rapid adoption of text-to-image (T2I) diffusion models, causal and representation-level analysis remains fragmented and largely limited t...
Related: #Text-to-Image
🇺🇸 Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients — 17/03/2026 [USA]
arXiv:2603.14665v1 Announce Type: new Abstract: Training data attribution (TDA) methods ask which training documents are responsible for a model behavior. We argue that this per-document framing is f...
Related: #Model Behavior
🇺🇸 SCAN: Sparse Circuit Anchor Interpretable Neuron for Lifelong Knowledge Editing — 17/03/2026 [USA]
arXiv:2603.15226v1 Announce Type: new Abstract: Large Language Models (LLMs) often suffer from catastrophic forgetting and collapse during sequential knowledge editing. This vulnerability stems from ...
Related: #Lifelong Learning
🇺🇸 Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis — 16/03/2026 [USA]
arXiv:2510.03366v2 Announce Type: replace-cross Abstract: Transformer-based language models excel at both recall (retrieving memorized facts) and reasoning (performing multi-step inference), but whet...
Related: #Transformer Architecture
🇺🇸 Attention Gathers, MLPs Compose: A Causal Analysis of an Action-Outcome Circuit in VideoViT — 13/03/2026 [USA]
arXiv:2603.11142v1 Announce Type: cross Abstract: The paper explores how video models trained for classification tasks represent nuanced, hidden semantic information that may not affect the final out...
Related: #Video Analysis
🇺🇸 Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models — 12/03/2026 [USA]
arXiv:2603.10098v1 Announce Type: cross Abstract: Recent advances in multi-agent reinforcement learning, particularly Policy-Space Response Oracles (PSRO), have enabled the computation of approximate...
Related: #Multi-Agent Systems
🇺🇸 Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations — 12/03/2026 [USA]
arXiv:2603.09988v1 Announce Type: cross Abstract: Mechanistic interpretability identifies internal circuits responsible for model behaviors, yet translating these findings into human-understandable e...
Related: #LLM Transparency
🇺🇸 Dissecting Chronos: Sparse Autoencoders Reveal Causal Feature Hierarchies in Time Series Foundation Models — 12/03/2026 [USA]
arXiv:2603.10071v1 Announce Type: cross Abstract: Time series foundation models (TSFMs) are increasingly deployed in high-stakes domains, yet their internal representations remain opaque. We present ...
Related: #Time Series Analysis
🇺🇸 FAME: Formal Abstract Minimal Explanation for Neural Networks — 12/03/2026 [USA]
arXiv:2603.10661v1 Announce Type: new Abstract: We propose FAME (Formal Abstract Minimal Explanations), a new class of abductive explanations grounded in abstract interpretation. FAME is the first me...
Related: #Neural Networks
🇺🇸 SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability — 09/03/2026 [USA]
arXiv:2507.06265v2 Announce Type: replace-cross Abstract: Understanding how different AI models encode the same high-level concepts, such as objects or attributes, remains challenging because each mo...
Related: #Cross-Modal Alignment
🇺🇸 Transformers converge to invariant algorithmic cores — 27/02/2026 [USA]
arXiv:2602.22600v1 Announce Type: cross Abstract: Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obsta...
Related: #Machine Learning, #Neural Networks

Key Entities (8)

About the topic: AI Interpretability

The topic "AI Interpretability" aggregates 15+ news articles from various countries.