#AI Interpretability
Latest news articles tagged with "AI Interpretability". Follow the timeline of events, related topics, and entities.
Articles (15)
-
πΊπΈ ConceptTracer: Interactive Analysis of Concept Saliency and Selectivity in Neural Representations
[USA]
arXiv:2604.07019v1 Announce Type: cross Abstract: Neural networks deliver impressive predictive performance across a variety of tasks, but they are often opaque in their decision-making processes. De...
Related: #Machine Learning, #Research Tool -
πΊπΈ WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior
[USA]
arXiv:2603.18474v1 Announce Type: cross Abstract: Precise behavioral control of large language models (LLMs) is critical for complex applications. However, existing methods often incur high training ...
Related: #Neural Networks -
πΊπΈ Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models
[USA]
arXiv:2603.18523v1 Announce Type: cross Abstract: Counting serves as a simple but powerful test of a Large Vision-Language Model's (LVLM's) reasoning; it forces the model to identify each individual ...
Related: #Visual Reasoning -
πΊπΈ Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations
[USA]
arXiv:2603.18353v1 Announce Type: new Abstract: Language models encode task-relevant knowledge in internal representations that far exceeds their output performance, but whether mechanistic interpret...
Related: #Model Errors -
πΊπΈ DreamReader: An Interpretability Toolkit for Text-to-Image Models
[USA]
arXiv:2603.13299v1 Announce Type: cross Abstract: Despite the rapid adoption of text-to-image (T2I) diffusion models, causal and representation-level analysis remains fragmented and largely limited t...
Related: #Text-to-Image -
πΊπΈ Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients
[USA]
arXiv:2603.14665v1 Announce Type: new Abstract: Training data attribution (TDA) methods ask which training documents are responsible for a model behavior. We argue that this per-document framing is f...
Related: #Model Behavior -
πΊπΈ SCAN: Sparse Circuit Anchor Interpretable Neuron for Lifelong Knowledge Editing
[USA]
arXiv:2603.15226v1 Announce Type: new Abstract: Large Language Models (LLMs) often suffer from catastrophic forgetting and collapse during sequential knowledge editing. This vulnerability stems from ...
Related: #Lifelong Learning -
πΊπΈ Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis
[USA]
arXiv:2510.03366v2 Announce Type: replace-cross Abstract: Transformer-based language models excel at both recall (retrieving memorized facts) and reasoning (performing multi-step inference), but whet...
Related: #Transformer Architecture -
πΊπΈ Attention Gathers, MLPs Compose: A Causal Analysis of an Action-Outcome Circuit in VideoViT
[USA]
arXiv:2603.11142v1 Announce Type: cross Abstract: The paper explores how video models trained for classification tasks represent nuanced, hidden semantic information that may not affect the final out...
Related: #Video Analysis -
πΊπΈ Code-Space Response Oracles: Generating Interpretable Multi-Agent Policies with Large Language Models
[USA]
arXiv:2603.10098v1 Announce Type: cross Abstract: Recent advances in multi-agent reinforcement learning, particularly Policy-Space Response Oracles (PSRO), have enabled the computation of approximate...
Related: #Multi-Agent Systems -
πΊπΈ Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations
[USA]
arXiv:2603.09988v1 Announce Type: cross Abstract: Mechanistic interpretability identifies internal circuits responsible for model behaviors, yet translating these findings into human-understandable e...
Related: #LLM Transparency -
πΊπΈ Dissecting Chronos: Sparse Autoencoders Reveal Causal Feature Hierarchies in Time Series Foundation Models
[USA]
arXiv:2603.10071v1 Announce Type: cross Abstract: Time series foundation models (TSFMs) are increasingly deployed in high-stakes domains, yet their internal representations remain opaque. We present ...
Related: #Time Series Analysis -
πΊπΈ FAME: Formal Abstract Minimal Explanation for Neural Networks
[USA]
arXiv:2603.10661v1 Announce Type: new Abstract: We propose FAME (Formal Abstract Minimal Explanations), a new class of abductive explanations grounded in abstract interpretation. FAME is the first me...
Related: #Neural Networks -
πΊπΈ SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability
[USA]
arXiv:2507.06265v2 Announce Type: replace-cross Abstract: Understanding how different AI models encode the same high-level concepts, such as objects or attributes, remains challenging because each mo...
Related: #Cross-Modal Alignment -
πΊπΈ Transformers converge to invariant algorithmic cores
[USA]
arXiv:2602.22600v1 Announce Type: cross Abstract: Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obsta...
Related: #Machine Learning, #Neural Networks
Key Entities (8)
- Neural network (2 news)
- Large language model (2 news)
- Machine learning (1 news)
- Transformers (1 news)
- Fame (1 news)
- Neutron activation analysis (1 news)
- Transformer (deep learning) (1 news)
- SPARC (1 news)
About the topic: AI Interpretability
The topic "AI Interpretability" aggregates 15+ news articles from various countries.