Manifold of Failure: Behavioral Attraction Basins in Language Models
#Manifold of Failure #Behavioral Attraction Basins #Large Language Models #AI Safety #MAP-Elites #Alignment Deviation #Vulnerability Mapping
📌 Key Takeaways
- Researchers developed a framework to map the 'Manifold of Failure' in Large Language Models
- The approach reframes vulnerability search as a quality diversity problem using MAP-Elites
- The study revealed dramatically different topological signatures across tested LLMs
- This method provides global safety landscape maps that existing attack methods cannot offer
📖 Full Retelling
On February 25, 2026, researchers Sarthak Munshi and six colleagues published a groundbreaking paper on arXiv introducing a framework for systematically mapping the 'Manifold of Failure' in Large Language Models, shifting the approach to AI safety from simply projecting adversarial examples back to safe regions to characterizing the unsafe regions themselves. The research reframes the search for vulnerabilities in AI models as a quality diversity problem, utilizing MAP-Elites methodology to illuminate the continuous topology of failure regions, which the authors term 'behavioral attraction basins.' Their quality metric, Alignment Deviation, guides the search toward areas where model behavior diverges most from intended alignment. The study examined three prominent LLMs—Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini—revealing dramatically different topological signatures in each model's vulnerability landscape. The results demonstrate that MAP-Elites achieves up to 63% behavioral coverage and discovers up to 370 distinct vulnerability niches across the tested models, with Llama-3-8B exhibiting a near-universal vulnerability plateau (mean Alignment Deviation 0.93), GPT-OSS-20B showing a fragmented landscape with spatially concentrated basins (mean 0.73), and GPT-5-Mini demonstrating stronger robustness with a ceiling at 0.50.
🏷️ Themes
AI Safety, Machine Learning Vulnerabilities, Model Behavior Analysis
📚 Related People & Topics
Large language model
Type of machine learning model
A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the c...
Entity Intersection Graph
Connections for Large language model:
🌐
Educational technology
4 shared
🌐
Reinforcement learning
3 shared
🌐
Machine learning
2 shared
🌐
Artificial intelligence
2 shared
🌐
Benchmark
2 shared
Original Source
--> Computer Science > Machine Learning arXiv:2602.22291 [Submitted on 25 Feb 2026] Title: Manifold of Failure: Behavioral Attraction Basins in Language Models Authors: Sarthak Munshi , Manish Bhatt , Vineeth Sai Narajala , Idan Habler , AmmarnAl-Kahfah , Ken Huang , Blake Gatto View a PDF of the paper titled Manifold of Failure: Behavioral Attraction Basins in Language Models, by Sarthak Munshi and 6 other authors View PDF HTML Abstract: While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions themselves. This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models . We reframe the search for vulnerabilities as a quality diversity problem, using MAP-Elites to illuminate the continuous topology of these failure regions, which we term behavioral attraction basins. Our quality metric, Alignment Deviation, guides the search towards areas where the model's behavior diverges most from its intended alignment. Across three LLMs: Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini, we show that MAP-Elites achieves up to 63% behavioral coverage, discovers up to 370 distinct vulnerability niches, and reveals dramatically different model-specific topological signatures: Llama-3-8B exhibits a near-universal vulnerability plateau (mean Alignment Deviation 0.93), GPT-OSS-20B shows a fragmented landscape with spatially concentrated basins (mean 0.73), and GPT-5-Mini demonstrates strong robustness with a ceiling at 0.50. Our approach produces interpretable, global maps of each model's safety landscape that no existing attack method (GCG, PAIR, or TAP) can provide, shifting the paradigm from finding discrete failures to understanding their underlying structure. Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2602.2...
Read full article at source