#Model Alignment

Latest news articles tagged with "Model Alignment". Follow the timeline of events, related topics, and entities.

Articles (4)

🇺🇸 Secure Linear Alignment of Large Language Models — 20/03/2026 [USA]
arXiv:2603.18908v1 Announce Type: new Abstract: Language models increasingly appear to learn similar representations, despite differences in training objectives, architectures, and data modalities. T...
Related: #AI Safety
🇺🇸 Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs — 19/02/2026 [USA]
arXiv:2501.16534v5 Announce Type: replace-cross Abstract: Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks...
Related: #Large Language Models, #Safety Classifiers, #Jailbreak Attacks, #LLM Security
🇺🇸 Steering Safely or Off a Cliff? Rethinking Specificity and Robustness in Inference-Time Interventions — 09/02/2026 [USA]
arXiv:2602.06256v1 Announce Type: cross Abstract: Model steering, which involves intervening on hidden representations at inference time, has emerged as a lightweight alternative to finetuning for pr...
Related: #Artificial Intelligence, #Machine Learning
🇺🇸 Detecting and reducing scheming in AI models — 17/09/2025 [USA]
Apollo Research and OpenAI developed evaluations for hidden misalignment (“scheming”) and found behaviors consistent with scheming in controlled tests across frontier models. The team shared concrete ...
Related: #AI Safety, #Deceptive AI Behaviors

The topic "Model Alignment" aggregates 4+ news articles from various countries.