#AI Safety
Latest news articles tagged with "AI Safety". Follow the timeline of events, related topics, and entities.
Articles (30)
-
πΊπΈ Warren accuses Trump, Hegseth of trying 'extort' Anthropic into removing AI guardrails
[USA]
Sen. Elizabeth Warren (D-Mass.) on Friday accused President Trump and Secretary of Defense Pete Hegseth of attempting to "extort" the company Anthropic into removing guardrails for its AI programs. Wa...
Related: #Artificial Intelligence (AI), #Government Regulation of AI, #Political Pressure, #Corporate Responsibility -
πΊπΈ What's at stake in the Pentagon-Anthropic dispute over AI guardrails
[USA]
For days, one of America's leading artificial intelligence companies and the Pentagon have been in a standoff over this question: who gets ultimate control over the use of that powerful technology? Jo...
Related: #Artificial Intelligence (AI), #Government Regulation, #National Security, #Innovation vs. Control -
πΊπΈ A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
[USA]
arXiv:2602.23163v1 Announce Type: new Abstract: Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms....
Related: #Steganography, #Information Theory, #Machine Learning -
πΊπΈ Training Agents to Self-Report Misbehavior
[USA]
arXiv:2602.22303v1 Announce Type: cross Abstract: Frontier AI agents may pursue hidden goals while concealing their pursuit from oversight. Alignment training aims to prevent such behavior by reinfor...
Related: #Alignment Research, #Transparency -
πΊπΈ CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety
[USA]
arXiv:2602.22557v1 Announce Type: new Abstract: Current safety mechanisms for Large Language Models (LLMs) rely heavily on static, fine-tuned classifiers that suffer from adaptation rigidity, the ina...
Related: #Machine Learning Frameworks, #Policy Adaptation -
πΊπΈ Manifold of Failure: Behavioral Attraction Basins in Language Models
[USA]
arXiv:2602.22291v1 Announce Type: cross Abstract: While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensi...
Related: #Machine Learning Vulnerabilities, #Model Behavior Analysis -
πΊπΈ Pressure Reveals Character: Behavioural Alignment Evaluation at Depth
[USA]
arXiv:2602.20813v1 Announce Type: new Abstract: Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignm...
Related: #Language Model Evaluation, #Alignment Research -
πΊπΈ No One Size Fits All: QueryBandits for Hallucination Mitigation
[USA]
arXiv:2602.20332v1 Announce Type: cross Abstract: Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-so...
Related: #Machine Learning, #Natural Language Processing -
πΊπΈ When can we trust untrusted monitoring? A safety case sketch across collusion strategies
[USA]
arXiv:2602.20628v1 Announce Type: new Abstract: AIs are increasingly being deployed with greater autonomy and capabilities, which increases the risk that a misaligned AI may be able to cause catastro...
Related: #Untrusted Monitoring, #Collusion Strategies, #Risk Assessment -
πΊπΈ VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation
[USA]
arXiv:2602.21054v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation ...
Related: #Artificial Intelligence, #Computer Vision, #Model Evaluation -
πΊπΈ Canada to Probe What OpenAI Knew About Tumbler Ridge Shooter
[USA]
The company suspended the killerβs ChatGPT account over a policy violation in June, eight months before the attacks in Tumbler Ridge, British Columbia.
Related: #Privacy Concerns, #Corporate Responsibility -
πΊπΈ ChatGPT-maker OpenAI safety representatives summoned to Canada after school shooting
[USA]
The Canadian government has called representatives of ChatGPT-maker OpenAI to Ottawa
Related: #Corporate Responsibility, #Government Regulation -
πΊπΈ Epistemic Traps: Rational Misalignment Driven by Model Misspecification
[USA]
arXiv:2602.17676v1 Announce Type: new Abstract: The rapid deployment of Large Language Models and AI agents across critical societal and technical domains is hindered by persistent behavioral patholo...
Related: #Theoretical Framework, #Model Engineering, #Rational Misalignment -
πΊπΈ OpenAI debated calling police about suspected Canadian shooterβs chats
[USA]
Jesse Van Rootselaar's descriptions of gun violence were flagged by tools that monitor ChatGPT for misuse.
Related: #Digital Responsibility, #Mental Health -
πΊπΈ Advancing independent research on AI alignment
[USA]
OpenAI commits $7.5M to The Alignment Project to fund independent AI alignment research, strengthening global efforts to address AGI safety and security risks.
Related: #Independent Research, #Global Collaboration, #Technological Governance -
πΊπΈ Automatically Finding Reward Model Biases
[USA]
arXiv:2602.15222v1 Announce Type: cross Abstract: Reward models are central to large language model (LLM) post-training. However, past work has shown that they can reward spurious or undesirable attr...
Related: #Large Language Models, #Reward Modeling, #Bias Detection, #Iterative Machine Learning -
πΊπΈ Toward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerability
[USA]
arXiv:2510.00565v2 Announce Type: replace Abstract: Diffusion language models (DLMs) generate tokens in parallel through iterative denoising, which can reduce latency and enable bidirectional conditi...
Related: #Diffusion Language Models, #Jailbreak Attack Vulnerabilities, #Iterative Denoising Mechanisms, #Model Mitigation Strategies -
πΊπΈ Safe Reinforcement Learning via Recovery-based Shielding with Gaussian Process Dynamics Models
[USA]
arXiv:2602.12444v1 Announce Type: cross Abstract: Reinforcement learning (RL) is a powerful framework for optimal decision-making and control but often lacks provable guarantees for safety-critical a...
Related: #Reinforcement Learning, #Control Systems -
πΊπΈ SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification
[USA]
arXiv:2512.15052v3 Announce Type: replace-cross Abstract: Disclaimer: Samples in this paper may be harmful and cause discomfort. Multimodal large language models (MLLMs) enable multimodal generatio...
Related: #Multimodal Models, #Neural Interventions -
πΊπΈ AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models
[USA]
arXiv:2602.06771v2 Announce Type: replace-cross Abstract: Concept erasure helps stop diffusion models (DMs) from generating harmful content; but current methods face robustness retention trade off. R...
Related: #Concept Erasure, #Diffusion Models, #Robustness -
πΊπΈ GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory
[USA]
arXiv:2602.12316v1 Announce Type: new Abstract: Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evalu...
Related: #Game Theory, #Multi-agent Systems -
πΊπΈ Consistency of Large Reasoning Models Under Multi-Turn Attacks
[USA]
arXiv:2602.13093v1 Announce Type: new Abstract: Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversa...
Related: #Model Robustness, #Adversarial Attacks -
πΊπΈ Selective Fine-Tuning for Targeted and Robust Concept Unlearning
[USA]
arXiv:2602.07919v1 Announce Type: new Abstract: Text guided diffusion models are used by millions of users, but can be easily exploited to produce harmful content. Concept unlearning methods aim at r...
Related: #Artificial Intelligence, #Machine Learning -
πΊπΈ When Evaluation Becomes a Side Channel: Regime Leakage and Structural Mitigations for Alignment Assessment
[USA]
arXiv:2602.08449v1 Announce Type: new Abstract: Safety evaluation for advanced AI systems implicitly assumes that behavior observed under evaluation is predictive of behavior in deployment. This assu...
Related: #Technology, #Security -
πΊπΈ Emergent Misalignment is Easy, Narrow Misalignment is Hard
[USA]
arXiv:2602.07852v1 Announce Type: new Abstract: Finetuning large language models on narrowly harmful datasets can cause them to become emergently misaligned, giving stereotypically `evil' responses a...
Related: #Machine Learning, #Technology -
πΊπΈ Are Reasoning LLMs Robust to Interventions on Their Chain-of-Thought?
[USA]
arXiv:2602.07470v1 Announce Type: new Abstract: Reasoning LLMs (RLLMs) generate step-by-step chains of thought (CoTs) before giving an answer, which improves performance on complex tasks and makes re...
Related: #Artificial Intelligence, #Machine Learning -
πΊπΈ Debate is efficient with your time
[USA]
arXiv:2602.08630v1 Announce Type: new Abstract: AI safety via debate uses two competing models to help a human judge verify complex computational tasks. Previous work has established what problems de...
Related: #Human Oversight, #Computational Linguistics -
πΊπΈ ShallowJail: Steering Jailbreaks against Large Language Models
[USA]
arXiv:2602.07107v1 Announce Type: cross Abstract: Large Language Models(LLMs) have been successful in numerous fields. Alignment has usually been applied to prevent them from harmful purposes. Howeve...
Related: #Cybersecurity, #Machine Learning -
πΊπΈ LLM Active Alignment: A Nash Equilibrium Perspective
[USA]
arXiv:2602.06836v1 Announce Type: new Abstract: We develop a game-theoretic framework for predicting and steering the behavior of populations of large language models (LLMs) through Nash equilibrium ...
Related: #Artificial Intelligence, #Game Theory -
πΊπΈ OpenDeception: Learning Deception and Trust in Human-AI Interaction via Multi-Agent Simulation
[USA]
arXiv:2504.13707v3 Announce Type: replace Abstract: As large language models (LLMs) are increasingly deployed as interactive agents, open-ended human-AI interactions can involve deceptive behaviors w...
Related: #Artificial Intelligence, #Cybersecurity