Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models
#Vision Language Models #AI Safety #Activation Steering #Neural Networks #Censorship #Machine Learning #arXiv
📌 Key Takeaways
- Researchers have introduced a new method called 'activation steering' to manage AI refusals.
- Current Vision Language Models suffer from inflexible safety filters that cause over-refusal or under-refusal.
- The new framework allows for configurable safety settings based on specific user contexts and needs.
- This method avoids expensive full-model retraining by targeting specific neural pathways during interference.
📖 Full Retelling
🏷️ Themes
Artificial Intelligence, Cybersecurity, Technology
📚 Related People & Topics
Neural network
Structure in biology and artificial intelligence
A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or mathematical models. While individual neurons are simple, many of them together in a network can perform complex tasks.
Censorship
Suppression of speech and information
Censorship is the suppression of speech, public communication, or other information. This may be done on the basis that such material is considered objectionable, harmful, sensitive, or "inconvenient". Censorship can be conducted by governments and private institutions.
🔗 Entity Intersection Graph
Connections for Neural network:
- 🌐 Deep learning (4 shared articles)
- 🌐 Reinforcement learning (2 shared articles)
- 🌐 Machine learning (2 shared articles)
- 🌐 Large language model (2 shared articles)
- 🌐 CSI (1 shared articles)
- 🌐 Mechanistic interpretability (1 shared articles)
- 🌐 Batch normalization (1 shared articles)
- 🌐 PPO (1 shared articles)
- 🌐 Global workspace theory (1 shared articles)
- 🌐 Cognitive neuroscience (1 shared articles)
- 🌐 Robustness (1 shared articles)
- 🌐 Homeostasis (1 shared articles)
📄 Original Source Content
arXiv:2602.07013v1 Announce Type: cross Abstract: With the rapid advancement of Vision Language Models (VLMs), refusal mechanisms have become a critical component for ensuring responsible and safe model behavior. However, existing refusal strategies are largely \textit{one-size-fits-all} and fail to adapt to diverse user needs and contextual constraints, leading to either under-refusal or over-refusal. In this work, we firstly explore the challenges mentioned above and develop \textbf{C}onfigur