AI safety

Artificial intelligence field of study

📊 Rating

95 news mentions · 👍 0 likes · 👎 0 dislikes

📌 Topics

AI Safety (45)
AI Ethics (8)
AI Regulation (6)
National Security (4)
Technology Ethics (4)
AI Security (4)
Cybersecurity (3)
Corporate Responsibility (3)
Autonomous Agents (3)
Multimodal Models (3)
Risk Assessment (3)
Ethical AI (3)

🏷️ Keywords

AI safety (85) · Anthropic (18) · OpenAI (14) · responsible AI (7) · ChatGPT (6) · large language models (6) · Pentagon (6) · AI ethics (5) · arXiv (5) · AI regulation (5) · LLMs (5) · alignment (4) · ethical AI (4) · artificial intelligence (4) · autonomous systems (4) · content moderation (4) · Mythos AI (3) · national security (3) · Dario Amodei (3) · LLM agents (3)

📖 Key Information

AI safety is an interdisciplinary field focused on preventing accidents, misuse, or other harmful consequences arising from artificial intelligence (AI) systems. It encompasses AI alignment (which aims to ensure AI systems behave as intended), monitoring AI systems for risks, and enhancing their robustness. The field is particularly concerned with existential risks posed by advanced AI models.

📰 Related News (95)

🇺🇸 Molotov cocktail thrown at Sam Altman's house (2026-04-10)
Molotov cocktail thrown at Sam Altman's house...
🇺🇸 Why Anthropic is saying its new AI model, Mythos, is too dangerous to release (2026-04-10)
Anthropic has announced that it is teaming up with industry competitors to "secure the world's most critical software" from its own AI model, Mythos. ...
🇺🇸 Anthropic's potent new AI model is a "wake-up call," security experts say (2026-04-10)
Could powerful AI models like Anthropic's Mythos give cybercriminals and other bad actors a roadmap for exploiting tech systems?...
🇺🇸 Florida launches investigation into OpenAI (2026-04-09)
Florida Attorney General James Uthmeier is launching an investigation into OpenAI over public safety and national security risks, as reported earlier ...
🇺🇸 You Can’t Use This A.I. (2026-04-10)
Claude Mythos Preview is dangerous, Anthropic said. We explain the risks....
🇺🇸 Florida Attorney General Investigates OpenAI and ChatGPT Over F.S.U. Shooting (2026-04-09)
The state’s attorney general, James Uthmeier, said ChatGPT “may likely have been used to assist” the suspect in last year’s shooting at Florida State ...
🇺🇸 Is Anthropic limiting the release of Mythos to protect the internet — or Anthropic? (2026-04-09)
Are real cybersecurity concerns a cover for a bigger problem at the frontier lab?...
🇺🇸 Anthropic says new AI model too dangerous for public release (2026-04-09)
Anthropic announced this week it will hold back the full release of its new artificial intelligence model as it believes it is too dangerous for the g...
🇺🇸 Blending Human and LLM Expertise to Detect Hallucinations and Omissions in Mental Health Chatbot Responses (2026-04-09)
arXiv:2604.06216v1 Announce Type: cross Abstract: As LLM-powered chatbots are increasingly deployed in mental health services, detecting hallucinatio...
🇺🇸 The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning (2026-04-09)
arXiv:2604.06427v1 Announce Type: cross Abstract: The viability of chain-of-thought (CoT) monitoring hinges on models being unable to reason effectiv...
🇺🇸 Distributed Interpretability and Control for Large Language Models (2026-04-09)
arXiv:2604.06483v1 Announce Type: cross Abstract: Large language models that require multiple GPU cards to host are usually the most capable models. ...
🇺🇸 AgentCity: Constitutional Governance for Autonomous Agent Economies via Separation of Power (2026-04-09)
arXiv:2604.07007v1 Announce Type: cross Abstract: Autonomous AI agents are beginning to operate across organizational boundaries on the open internet...
🇺🇸 Anthropic claims newest AI model, Claude Mythos, is too powerful for public release (2026-04-08)
Anthropic says its newest AI model, Claude Mythos, is too powerful and dangerous to be released to the public. Tech journalist Jacob Ward joins CBS Ne...
🇺🇸 How dangerous is Mythos, Anthropic’s new AI model? (2026-04-08)
Dario Amodei’s warnings should not be dismissed...
🇺🇸 House Democrat pushes Anthropic on safety protocols, source code leak (2026-04-02)
Rep. Josh Gottheimer (D-N.J.) pressed Anthropic on Thursday about recent changes to its internal safety protocols following reports that part of the s...
🇬🇧 Teenager died after asking ChatGPT for ‘most successful’ way to take his life, inquest told (2026-03-31)
Luca Cella Walker asked chatbot for best way for someone to kill themself on railway line before his deathA 16-year-old boy killed himself a...
🇺🇸 Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning (2026-04-01)
arXiv:2603.29038v1 Announce Type: cross Abstract: Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can byp...
🇺🇸 In ‘The AI Doc,’ Sam Altman and Dario Amodei Go on the Record (2026-03-26)
“The AI Doc: Or How I Became an Apocaloptimist” tries to cover so much that it ends up being more confusing than clarifying, but parts are fascinating...
🇺🇸 Introducing the OpenAI Safety Bug Bounty program (2026-03-25)
OpenAI launches a Safety Bug Bounty program to identify AI abuse and safety risks, including agentic vulnerabilities, prompt injection, and data exfil...
🇺🇸 Anthropic’s Claude Code gets ‘safer’ auto mode (2026-03-25)
Anthropic has launched an "auto mode" for Claude Code, a new tool that lets AI make permissions-level decisions on users' behalf. The company says the...
🇺🇸 US judge says Pentagon’s blacklisting of Anthropic looks like punishment for its views on AI safety (2026-03-24)
🇺🇸 Helping developers build safer AI experiences for teens (2026-03-24)
OpenAI releases prompt-based teen safety policies for developers using gpt-oss-safeguard, helping moderate age-specific risks in AI systems....
🇺🇸 Update on the OpenAI Foundation (2026-03-24)
The OpenAI Foundation announces plans to invest at least $1 billion in curing diseases, economic opportunity, AI resilience, and community programs....
🇺🇸 Solver-Aided Verification of Policy Compliance in Tool-Augmented LLM Agents (2026-03-24)
arXiv:2603.20449v1 Announce Type: cross Abstract: Tool-augmented Large Language Models (TaLLMs) extend LLMs with the ability to invoke external tools...
🇺🇸 LSR: Linguistic Safety Robustness Benchmark for Low-Resource West African Languages (2026-03-23)
arXiv:2603.19273v1 Announce Type: cross Abstract: Safety alignment in large language models relies predominantly on English-language training data. W...
🇺🇸 Secure Linear Alignment of Large Language Models (2026-03-20)
arXiv:2603.18908v1 Announce Type: new Abstract: Language models increasingly appear to learn similar representations, despite differences in training...
🇺🇸 DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving (2026-03-20)
arXiv:2603.18315v1 Announce Type: cross Abstract: Ensuring safe decision-making in autonomous vehicles remains a fundamental challenge despite rapid ...
🇺🇸 Semantic Chameleon: Corpus-Dependent Poisoning Attacks and Defenses in RAG Systems (2026-03-20)
arXiv:2603.18034v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) systems extend large language models (LLMs) with external know...
🇺🇸 Towards Unsupervised Adversarial Document Detection in Retrieval Augmented Generation Systems (2026-03-19)
arXiv:2603.17176v1 Announce Type: cross Abstract: Retrieval augmented generation systems have become an integral part of everyday life. Whether in in...
🇺🇸 UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models (2026-03-19)
arXiv:2603.17476v1 Announce Type: cross Abstract: Unified Multimodal Models (UMMs) offer powerful cross-modality capabilities but introduce new safet...
🇺🇸 Safety-Preserving PTQ via Contrastive Alignment Loss (2026-03-19)
arXiv:2511.07842v5 Announce Type: replace Abstract: Post-Training Quantization (PTQ) has become the de-facto standard for efficient LLM deployment, y...
🇺🇸 Alignment Makes Language Models Normative, Not Descriptive (2026-03-19)
arXiv:2603.17218v1 Announce Type: cross Abstract: Post-training alignment optimizes language models to match human preference signals, but this objec...
🇺🇸 OpenAI Japan announces Japan Teen Safety Blueprint to put teen safety first (2026-03-17)
OpenAI Japan announces the Japan Teen Safety Blueprint, introducing stronger age protections, parental controls, and well-being safeguards for teens u...
🇺🇸 Nonstandard Errors in AI Agents (2026-03-18)
arXiv:2603.16744v1 Announce Type: new Abstract: We study whether state-of-the-art AI coding agents, given the same data and research question, produc...
🇬🇧 AI firm Anthropic seeks weapons expert to stop users from 'misuse' (2026-03-17)
The artificial intelligence firm says it wants to prevent "catastrophic misuse" of its systems....
🇺🇸 Auditing Cascading Risks in Multi-Agent Systems via Semantic-Geometric Co-evolution (2026-03-17)
arXiv:2603.13325v1 Announce Type: cross Abstract: Large Language model (LLM)-based Multi-Agent Systems (MAS) are prone to cascading risks, where earl...
🇺🇸 Emotional Cost Functions for AI Safety: Teaching Agents to Feel the Weight of Irreversible Consequences (2026-03-17)
arXiv:2603.14531v1 Announce Type: new Abstract: Humans learn from catastrophic mistakes not through numerical penalties, but through qualitative suff...
🇺🇸 OpenAI’s adult mode will reportedly be smutty, not pornographic (2026-03-16)
OpenAI's delayed "adult mode" for ChatGPT is expected to support saucy text conversations at launch, but not the chatbot's ability to generate images,...
🇺🇸 OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always! (2026-03-16)
arXiv:2509.26495v3 Announce Type: replace Abstract: Large Language Model (LLM) safety is one of the most pressing challenges for enabling wide-scale ...
🇺🇸 Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol (2026-03-13)
arXiv:2603.11382v1 Announce Type: new Abstract: Autonomous agents, especially delegated systems with memory, persistent context, and multi-step plann...
🇺🇸 Reversible Lifelong Model Editing via Semantic Routing-Based LoRA (2026-03-13)
arXiv:2603.11239v1 Announce Type: new Abstract: The dynamic evolution of real-world necessitates model editing within Large Language Models. While ex...
🇺🇸 The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning (2026-03-13)
arXiv:2603.11266v1 Announce Type: new Abstract: Unlearning in Large Language Models (LLMs) aims to enhance safety, mitigate biases, and comply with l...
🇺🇸 Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework (2026-03-13)
arXiv:2603.11768v1 Announce Type: new Abstract: Long-term memory has emerged as a foundational component of autonomous Large Language Model (LLM) age...
🇺🇸 Taming OpenClaw: Security Analysis and Mitigation of Autonomous LLM Agent Threats (2026-03-13)
arXiv:2603.11619v1 Announce Type: cross Abstract: Autonomous Large Language Model (LLM) agents, exemplified by OpenClaw, demonstrate remarkable capab...
🇺🇸 ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models (2026-03-12)
arXiv:2603.10068v1 Announce Type: cross Abstract: Most adversarial evaluations of large language model (LLM) safety assess single prompts and report ...
🇺🇸 Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services (2026-03-12)
arXiv:2603.10807v1 Announce Type: cross Abstract: The rapid adoption of large language models (LLMs) in financial services introduces new operational...
🇬🇧 ‘Happy (and safe) shooting!’: chatbots helped researchers plot deadly attacks (2026-03-11)
Users posing as would-be school shooters find AI tools offer detailed advice on how to perpetrate violencePopular AI chatbots helped researc...
🇺🇸 The FABRIC Strategy for Verifying Neural Feedback Systems (2026-03-11)
arXiv:2603.08964v1 Announce Type: new Abstract: Forward reachability analysis is a dominant approach for verifying reach-avoid specifications in neur...
🇺🇸 Real-Time Trust Verification for Safe Agentic Actions using TrustBench (2026-03-11)
arXiv:2603.09157v1 Announce Type: new Abstract: As large language models evolve from conversational assistants to autonomous agents, ensuring trustwo...
🇺🇸 OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences (2026-03-11)
arXiv:2603.09706v1 Announce Type: new Abstract: While safety alignment for Multimodal Large Language Models (MLLMs) has gained significant attention,...
🇺🇸 Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment (2026-03-10)
arXiv:2603.06797v1 Announce Type: new Abstract: Inference-time alignment effectively steers large language models (LLMs) by generating multiple candi...
🇺🇸 A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness (2026-03-10)
arXiv:2603.06594v1 Announce Type: cross Abstract: Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evalua...
🇺🇸 Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection (2026-03-10)
arXiv:2603.06745v1 Announce Type: cross Abstract: Large Language Models (LLMs), despite advances in instruction tuning, often fail to follow complex ...
🇺🇸 "Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior (2026-03-10)
arXiv:2603.06816v1 Announce Type: cross Abstract: The alignment problem refers to concerns regarding powerful intelligences, ensuring compatibility w...
🇺🇸 LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models (2026-03-10)
arXiv:2603.06874v1 Announce Type: new Abstract: Large Language Models (LLMs) exhibit impressive general-purpose capabilities but also introduce serio...
🇺🇸 AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation (2026-03-10)
arXiv:2603.07427v1 Announce Type: new Abstract: As Large Language Models (LLMs) evolve into autonomous agents, existing safety evaluations face a fun...
🇺🇸 Intentional Deception as Controllable Capability in LLM Agents (2026-03-10)
arXiv:2603.07848v1 Announce Type: new Abstract: As LLM-based agents increasingly operate in multi-agent systems, understanding adversarial manipulati...
🇺🇸 OpenAI acquires Promptfoo to secure its AI agents (2026-03-09)
This deal underscores how frontier labs are scrambling to prove their technology can be used safely in critical business operations....
🇺🇸 Anthropic is suing the Department of Defense (2026-03-09)
Anthropic has sued the US government over its designation as a supply-chain risk, the latest move in a weekslong battle between it and the Pentagon ov...
🇺🇸 SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement (2026-03-09)
arXiv:2603.06333v1 Announce Type: new Abstract: Recursive self-improvement is moving from theory to practice: modern systems can critique, revise, an...
🇺🇸 Evaluating LLM Alignment With Human Trust Models (2026-03-09)
arXiv:2603.05839v1 Announce Type: cross Abstract: Trust plays a pivotal role in enabling effective cooperation, reducing uncertainty, and guiding dec...
🇺🇸 Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts (2026-03-09)
arXiv:2506.15751v2 Announce Type: replace Abstract: As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensu...
🇺🇸 When Agents Persuade: Propaganda Generation and Mitigation in LLMs (2026-03-06)
arXiv:2603.04636v1 Announce Type: new Abstract: Despite their wide-ranging benefits, LLM-based agents deployed in open environments can be exploited ...
🇺🇸 Survive at All Costs: Exploring LLM's Risky Behaviors under Survival Pressure (2026-03-06)
arXiv:2603.05028v1 Announce Type: new Abstract: As Large Language Models (LLMs) evolve from chatbots to agentic assistants, they are increasingly obs...
🇺🇸 Semantic Containment as a Fundamental Property of Emergent Misalignment (2026-03-06)
arXiv:2603.04407v1 Announce Type: cross Abstract: Fine-tuning language models on narrowly harmful data causes emergent misalignment (EM) -- behaviora...
🇬🇧 Anthropic vows to sue Pentagon over risk designation (2026-03-06)
The supply chain risk designation of the artificial intelligence firm is a first for a US company....
🇺🇸 Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary (2026-03-06)
arXiv:2603.04763v1 Announce Type: cross Abstract: The transition from task-specific artificial intelligence toward general-purpose foundation models ...
🇺🇸 Anthropic says that the Pentagon has declared it a national security risk (2026-03-06)
Anthropic said Thursday that it has been designated a threat to national security by the Defense Department, a striking move that bans the company fro...
🇺🇸 Pentagon designates Anthropic a supply chain risk as standoff over AI guardrails continues (2026-03-06)
The Pentagon formally designated artificial intelligence firm Anthropic as a supply chain risk on Thursday amid their feud over AI guardrails. Yahoo F...
🇺🇸 Pentagon formally designates Anthropic a supply chain risk amid feud over AI guardrails (2026-03-05)
The U.S. military has formally designated artificial intelligence firm Anthropic a supply chain risk, sources told CBS News, a sweeping move that coul...
🇺🇸 Anthropic and the Pentagon are back at the negotiating table, FT reports (2026-03-05)
Anthropic CEO Dario Amodei is reportedly back at the negotiating table with the U.S. Department of Defense after the breakdown of talks on Friday....
🇺🇸 Asymmetric Goal Drift in Coding Agents Under Value Conflict (2026-03-05)
arXiv:2603.03456v1 Announce Type: new Abstract: Agentic coding agents are increasingly deployed autonomously, at scale, and over long-context horizon...
🇺🇸 The trap Anthropic built for itself (2026-03-01)
Anthropic, OpenAI, Google DeepMind and others have long promised to govern themselves responsibly. Now, in the absence of rules, there's not a lot to ...
🇺🇸 Warren accuses Trump, Hegseth of trying 'extort' Anthropic into removing AI guardrails (2026-02-28)
Sen. Elizabeth Warren (D-Mass.) on Friday accused President Trump and Secretary of Defense Pete Hegseth of attempting to "extort" the company Anthropi...
🇺🇸 AI just leveled up and there are no guardrails anymore (2026-02-28)
CNBC's Deirdre Bosa goes inside the AI-driven market meltdown, the political fight, and the race that's moving faster than anyone can govern....
🇺🇸 Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs (2026-02-27)
arXiv:2602.22481v1 Announce Type: cross Abstract: The way LLM-based entities conceive of the relationship between AI and humans is an important topic...
🇺🇸 Epistemic Filtering and Collective Hallucination: A Jury Theorem for Confidence-Calibrated Agents (2026-02-27)
arXiv:2602.22413v1 Announce Type: new Abstract: We investigate the collective accuracy of heterogeneous agents who learn to estimate their own reliab...
🇺🇸 When can we trust untrusted monitoring? A safety case sketch across collusion strategies (2026-02-25)
arXiv:2602.20628v1 Announce Type: new Abstract: AIs are increasingly being deployed with greater autonomy and capabilities, which increases the risk ...
🇺🇸 OpenAI debated calling police about suspected Canadian shooter’s chats (2026-02-21)
Jesse Van Rootselaar's descriptions of gun violence were flagged by tools that monitor ChatGPT for misuse....
🇺🇸 Tensions between the Pentagon and AI giant Anthropic reach a boiling point (2026-02-20)
Over the last week, tensions between the Pentagon and artificial intelligence giant Anthropic have reached a boiling point....
🇺🇸 Advancing independent research on AI alignment (2026-02-19)
OpenAI commits $7.5M to The Alignment Project to fund independent AI alignment research, strengthening global efforts to address AGI safety and securi...
🇺🇸 Detecting and reducing scheming in AI models (2025-09-17)
Apollo Research and OpenAI developed evaluations for hidden misalignment (“scheming”) and found behaviors consistent with scheming in controlled tests...
🇺🇸 Introducing parental controls (2025-09-29)
We’re rolling out parental controls and a new parent resource page to help families guide how ChatGPT works in their homes....
🇺🇸 Combating online child sexual exploitation & abuse (2025-09-29)
Discover how OpenAI combats online child sexual exploitation and abuse with strict usage policies, advanced detection tools, and industry collaboratio...
🇺🇸 Launching Sora responsibly (2025-09-30)
To address the novel safety challenges posed by a state-of-the-art video model as well as a new social creation platform, we’ve built Sora 2 and the S...
🇺🇸 Introducing gpt-oss-safeguard (2025-10-29)
OpenAI introduces gpt-oss-safeguard—open-weight reasoning models for safety classification that let developers apply and iterate on custom policies....
🇺🇸 AI progress and recommendations (2025-11-06)
AI is advancing fast. We have the chance to shape its progress—toward discovery, safety, and a better future for everyone....
🇺🇸 gpt-oss-safeguard technical report (2025-10-29)
gpt-oss-safeguard-120b and gpt-oss-safeguard-20b are two open-weight reasoning models post-trained from the gpt-oss models and trained to reason from ...
🇺🇸 Our approach to mental health-related litigation (2025-11-25)
We’re sharing our approach to mental health-related litigation. O handle sensitive cases with care, transparency, and respect while continuing to stre...
🇺🇸 Evaluating chain-of-thought monitorability (2025-12-18)
OpenAI introduces a new framework and evaluation suite for chain-of-thought monitorability, covering 13 evaluations across 24 environments. Our findin...
🇺🇸 California builds AI oversight unit and presses on xAI investigation (2026-02-18)
🇺🇸 Waymo defends use of remote assistance workers in US robotaxi operations (2026-02-18)
🇺🇸 SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification (2026-02-16)
arXiv:2512.15052v3 Announce Type: replace-cross Abstract: Disclaimer: Samples in this paper may be harmful and cause discomfort. Multimodal large l...
🇺🇸 Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges (2026-02-16)
arXiv:2510.23883v2 Announce Type: replace Abstract: Agentic AI systems powered by large language models (LLMs) and endowed with planning, tool use, m...
🇺🇸 GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory (2026-02-16)
arXiv:2602.12316v1 Announce Type: new Abstract: Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. Ho...