#Model Alignment
Latest news articles tagged with "Model Alignment". Follow the timeline of events, related topics, and entities.
Articles (3)
-
๐บ๐ธ Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs
[USA]
arXiv:2501.16534v5 Announce Type: replace-cross Abstract: Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks...
Related: #Large Language Models, #Safety Classifiers, #Jailbreak Attacks, #LLM Security -
๐บ๐ธ Steering Safely or Off a Cliff? Rethinking Specificity and Robustness in Inference-Time Interventions
[USA]
arXiv:2602.06256v1 Announce Type: cross Abstract: Model steering, which involves intervening on hidden representations at inference time, has emerged as a lightweight alternative to finetuning for pr...
Related: #Artificial Intelligence, #Machine Learning -
๐บ๐ธ Detecting and reducing scheming in AI models
[USA]
Apollo Research and OpenAI developed evaluations for hidden misalignment (โschemingโ) and found behaviors consistent with scheming in controlled tests across frontier models. The team shared concrete ...
Related: #AI Safety, #Deceptive AI Behaviors