AI safety
Artificial intelligence field of study
๐ Rating
95 news mentions ยท ๐ 0 likes ยท ๐ 0 dislikes
๐ Topics
- AI Safety (45)
- AI Ethics (8)
- AI Regulation (6)
- National Security (4)
- Technology Ethics (4)
- AI Security (4)
- Cybersecurity (3)
- Corporate Responsibility (3)
- Autonomous Agents (3)
- Multimodal Models (3)
- Risk Assessment (3)
- Ethical AI (3)
๐ท๏ธ Keywords
AI safety (85) ยท Anthropic (18) ยท OpenAI (14) ยท responsible AI (7) ยท ChatGPT (6) ยท large language models (6) ยท Pentagon (6) ยท AI ethics (5) ยท arXiv (5) ยท AI regulation (5) ยท LLMs (5) ยท alignment (4) ยท ethical AI (4) ยท artificial intelligence (4) ยท autonomous systems (4) ยท content moderation (4) ยท Mythos AI (3) ยท national security (3) ยท Dario Amodei (3) ยท LLM agents (3)
๐ Key Information
๐ฐ Related News (95)
-
๐บ๐ธ Molotov cocktail thrown at Sam Altman's house
Molotov cocktail thrown at Sam Altman's house...
-
๐บ๐ธ Why Anthropic is saying its new AI model, Mythos, is too dangerous to release
Anthropic has announced that it is teaming up with industry competitors to "secure the world's most critical software" from its own AI model, Mythos. ...
-
๐บ๐ธ Anthropic's potent new AI model is a "wake-up call," security experts say
Could powerful AI models like Anthropic's Mythos give cybercriminals and other bad actors a roadmap for exploiting tech systems?...
-
๐บ๐ธ Florida launches investigation into OpenAI
Florida Attorney General James Uthmeier is launching an investigation into OpenAI over public safety and national security risks, as reported earlier ...
-
๐บ๐ธ You Canโt Use This A.I.Claude Mythos Preview is dangerous, Anthropic said. We explain the risks....
-
๐บ๐ธ Florida Attorney General Investigates OpenAI and ChatGPT Over F.S.U. ShootingThe stateโs attorney general, James Uthmeier, said ChatGPT โmay likely have been used to assistโ the suspect in last yearโs shooting at Florida State ...
-
๐บ๐ธ Is Anthropic limiting the release of Mythos to protect the internet โ or Anthropic?
Are real cybersecurity concerns a cover for a bigger problem at the frontier lab?...
-
๐บ๐ธ Anthropic says new AI model too dangerous for public releaseAnthropic announced this week it will hold back the full release of its new artificial intelligence model as it believes it is too dangerous for the g...
-
๐บ๐ธ Blending Human and LLM Expertise to Detect Hallucinations and Omissions in Mental Health Chatbot Responses
arXiv:2604.06216v1 Announce Type: cross Abstract: As LLM-powered chatbots are increasingly deployed in mental health services, detecting hallucinatio...
-
๐บ๐ธ The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning
arXiv:2604.06427v1 Announce Type: cross Abstract: The viability of chain-of-thought (CoT) monitoring hinges on models being unable to reason effectiv...
-
๐บ๐ธ Distributed Interpretability and Control for Large Language Models
arXiv:2604.06483v1 Announce Type: cross Abstract: Large language models that require multiple GPU cards to host are usually the most capable models. ...
-
๐บ๐ธ AgentCity: Constitutional Governance for Autonomous Agent Economies via Separation of Power
arXiv:2604.07007v1 Announce Type: cross Abstract: Autonomous AI agents are beginning to operate across organizational boundaries on the open internet...
-
๐บ๐ธ Anthropic claims newest AI model, Claude Mythos, is too powerful for public release
Anthropic says its newest AI model, Claude Mythos, is too powerful and dangerous to be released to the public. Tech journalist Jacob Ward joins CBS Ne...
-
๐บ๐ธ How dangerous is Mythos, Anthropicโs new AI model?
Dario Amodeiโs warnings should not be dismissed...
-
๐บ๐ธ House Democrat pushes Anthropic on safety protocols, source code leakRep. Josh Gottheimer (D-N.J.) pressed Anthropic on Thursday about recent changes to its internal safety protocols following reports that part of the s...
-
๐ฌ๐ง Teenager died after asking ChatGPT for โmost successfulโ way to take his life, inquest told<p>Luca Cella Walker asked chatbot for best way for someone to kill themself on railway line before his death</p><p>A 16-year-old boy killed himself a...
-
๐บ๐ธ Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning
arXiv:2603.29038v1 Announce Type: cross Abstract: Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can byp...
-
๐บ๐ธ In โThe AI Doc,โ Sam Altman and Dario Amodei Go on the RecordโThe AI Doc: Or How I Became an Apocaloptimistโ tries to cover so much that it ends up being more confusing than clarifying, but parts are fascinating...
-
๐บ๐ธ Introducing the OpenAI Safety Bug Bounty program
OpenAI launches a Safety Bug Bounty program to identify AI abuse and safety risks, including agentic vulnerabilities, prompt injection, and data exfil...
-
๐บ๐ธ Anthropicโs Claude Code gets โsaferโ auto mode
Anthropic has launched an "auto mode" for Claude Code, a new tool that lets AI make permissions-level decisions on users' behalf. The company says the...
-
-
๐บ๐ธ Helping developers build safer AI experiences for teens
OpenAI releases prompt-based teen safety policies for developers using gpt-oss-safeguard, helping moderate age-specific risks in AI systems....
-
๐บ๐ธ Update on the OpenAI Foundation
The OpenAI Foundation announces plans to invest at least $1 billion in curing diseases, economic opportunity, AI resilience, and community programs....
-
๐บ๐ธ Solver-Aided Verification of Policy Compliance in Tool-Augmented LLM Agents
arXiv:2603.20449v1 Announce Type: cross Abstract: Tool-augmented Large Language Models (TaLLMs) extend LLMs with the ability to invoke external tools...
-
๐บ๐ธ LSR: Linguistic Safety Robustness Benchmark for Low-Resource West African Languages
arXiv:2603.19273v1 Announce Type: cross Abstract: Safety alignment in large language models relies predominantly on English-language training data. W...
-
๐บ๐ธ Secure Linear Alignment of Large Language Models
arXiv:2603.18908v1 Announce Type: new Abstract: Language models increasingly appear to learn similar representations, despite differences in training...
-
๐บ๐ธ DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving
arXiv:2603.18315v1 Announce Type: cross Abstract: Ensuring safe decision-making in autonomous vehicles remains a fundamental challenge despite rapid ...
-
๐บ๐ธ Semantic Chameleon: Corpus-Dependent Poisoning Attacks and Defenses in RAG Systems
arXiv:2603.18034v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) systems extend large language models (LLMs) with external know...
-
๐บ๐ธ Towards Unsupervised Adversarial Document Detection in Retrieval Augmented Generation Systems
arXiv:2603.17176v1 Announce Type: cross Abstract: Retrieval augmented generation systems have become an integral part of everyday life. Whether in in...
-
๐บ๐ธ UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models
arXiv:2603.17476v1 Announce Type: cross Abstract: Unified Multimodal Models (UMMs) offer powerful cross-modality capabilities but introduce new safet...
-
๐บ๐ธ Safety-Preserving PTQ via Contrastive Alignment Loss
arXiv:2511.07842v5 Announce Type: replace Abstract: Post-Training Quantization (PTQ) has become the de-facto standard for efficient LLM deployment, y...
-
๐บ๐ธ Alignment Makes Language Models Normative, Not Descriptive
arXiv:2603.17218v1 Announce Type: cross Abstract: Post-training alignment optimizes language models to match human preference signals, but this objec...
-
๐บ๐ธ OpenAI Japan announces Japan Teen Safety Blueprint to put teen safety first
OpenAI Japan announces the Japan Teen Safety Blueprint, introducing stronger age protections, parental controls, and well-being safeguards for teens u...
-
๐บ๐ธ Nonstandard Errors in AI Agents
arXiv:2603.16744v1 Announce Type: new Abstract: We study whether state-of-the-art AI coding agents, given the same data and research question, produc...
-
๐ฌ๐ง AI firm Anthropic seeks weapons expert to stop users from 'misuse'The artificial intelligence firm says it wants to prevent "catastrophic misuse" of its systems....
-
๐บ๐ธ Auditing Cascading Risks in Multi-Agent Systems via Semantic-Geometric Co-evolution
arXiv:2603.13325v1 Announce Type: cross Abstract: Large Language model (LLM)-based Multi-Agent Systems (MAS) are prone to cascading risks, where earl...
-
๐บ๐ธ Emotional Cost Functions for AI Safety: Teaching Agents to Feel the Weight of Irreversible Consequences
arXiv:2603.14531v1 Announce Type: new Abstract: Humans learn from catastrophic mistakes not through numerical penalties, but through qualitative suff...
-
๐บ๐ธ OpenAIโs adult mode will reportedly be smutty, not pornographic
OpenAI's delayed "adult mode" for ChatGPT is expected to support saucy text conversations at launch, but not the chatbot's ability to generate images,...
-
๐บ๐ธ OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!
arXiv:2509.26495v3 Announce Type: replace Abstract: Large Language Model (LLM) safety is one of the most pressing challenges for enabling wide-scale ...
-
๐บ๐ธ Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol
arXiv:2603.11382v1 Announce Type: new Abstract: Autonomous agents, especially delegated systems with memory, persistent context, and multi-step plann...
-
๐บ๐ธ Reversible Lifelong Model Editing via Semantic Routing-Based LoRA
arXiv:2603.11239v1 Announce Type: new Abstract: The dynamic evolution of real-world necessitates model editing within Large Language Models. While ex...
-
๐บ๐ธ The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning
arXiv:2603.11266v1 Announce Type: new Abstract: Unlearning in Large Language Models (LLMs) aims to enhance safety, mitigate biases, and comply with l...
-
๐บ๐ธ Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework
arXiv:2603.11768v1 Announce Type: new Abstract: Long-term memory has emerged as a foundational component of autonomous Large Language Model (LLM) age...
-
๐บ๐ธ Taming OpenClaw: Security Analysis and Mitigation of Autonomous LLM Agent Threats
arXiv:2603.11619v1 Announce Type: cross Abstract: Autonomous Large Language Model (LLM) agents, exemplified by OpenClaw, demonstrate remarkable capab...
-
๐บ๐ธ ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models
arXiv:2603.10068v1 Announce Type: cross Abstract: Most adversarial evaluations of large language model (LLM) safety assess single prompts and report ...
-
๐บ๐ธ Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services
arXiv:2603.10807v1 Announce Type: cross Abstract: The rapid adoption of large language models (LLMs) in financial services introduces new operational...
-
๐ฌ๐ง โHappy (and safe) shooting!โ: chatbots helped researchers plot deadly attacks<p>Users posing as would-be school shooters find AI tools offer detailed advice on how to perpetrate violence</p><p>Popular AI chatbots helped researc...
-
๐บ๐ธ The FABRIC Strategy for Verifying Neural Feedback Systems
arXiv:2603.08964v1 Announce Type: new Abstract: Forward reachability analysis is a dominant approach for verifying reach-avoid specifications in neur...
-
๐บ๐ธ Real-Time Trust Verification for Safe Agentic Actions using TrustBench
arXiv:2603.09157v1 Announce Type: new Abstract: As large language models evolve from conversational assistants to autonomous agents, ensuring trustwo...
-
๐บ๐ธ OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences
arXiv:2603.09706v1 Announce Type: new Abstract: While safety alignment for Multimodal Large Language Models (MLLMs) has gained significant attention,...
-
๐บ๐ธ Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment
arXiv:2603.06797v1 Announce Type: new Abstract: Inference-time alignment effectively steers large language models (LLMs) by generating multiple candi...
-
๐บ๐ธ A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness
arXiv:2603.06594v1 Announce Type: cross Abstract: Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evalua...
-
๐บ๐ธ Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection
arXiv:2603.06745v1 Announce Type: cross Abstract: Large Language Models (LLMs), despite advances in instruction tuning, often fail to follow complex ...
-
๐บ๐ธ "Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior
arXiv:2603.06816v1 Announce Type: cross Abstract: The alignment problem refers to concerns regarding powerful intelligences, ensuring compatibility w...
-
๐บ๐ธ LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models
arXiv:2603.06874v1 Announce Type: new Abstract: Large Language Models (LLMs) exhibit impressive general-purpose capabilities but also introduce serio...
-
๐บ๐ธ AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation
arXiv:2603.07427v1 Announce Type: new Abstract: As Large Language Models (LLMs) evolve into autonomous agents, existing safety evaluations face a fun...
-
๐บ๐ธ Intentional Deception as Controllable Capability in LLM Agents
arXiv:2603.07848v1 Announce Type: new Abstract: As LLM-based agents increasingly operate in multi-agent systems, understanding adversarial manipulati...
-
๐บ๐ธ OpenAI acquires Promptfoo to secure its AI agents
This deal underscores how frontier labs are scrambling to prove their technology can be used safely in critical business operations....
-
๐บ๐ธ Anthropic is suing the Department of Defense
Anthropic has sued the US government over its designation as a supply-chain risk, the latest move in a weekslong battle between it and the Pentagon ov...
-
๐บ๐ธ SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement
arXiv:2603.06333v1 Announce Type: new Abstract: Recursive self-improvement is moving from theory to practice: modern systems can critique, revise, an...
-
๐บ๐ธ Evaluating LLM Alignment With Human Trust Models
arXiv:2603.05839v1 Announce Type: cross Abstract: Trust plays a pivotal role in enabling effective cooperation, reducing uncertainty, and guiding dec...
-
๐บ๐ธ Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
arXiv:2506.15751v2 Announce Type: replace Abstract: As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensu...
-
๐บ๐ธ When Agents Persuade: Propaganda Generation and Mitigation in LLMs
arXiv:2603.04636v1 Announce Type: new Abstract: Despite their wide-ranging benefits, LLM-based agents deployed in open environments can be exploited ...
-
๐บ๐ธ Survive at All Costs: Exploring LLM's Risky Behaviors under Survival Pressure
arXiv:2603.05028v1 Announce Type: new Abstract: As Large Language Models (LLMs) evolve from chatbots to agentic assistants, they are increasingly obs...
-
๐บ๐ธ Semantic Containment as a Fundamental Property of Emergent Misalignment
arXiv:2603.04407v1 Announce Type: cross Abstract: Fine-tuning language models on narrowly harmful data causes emergent misalignment (EM) -- behaviora...
-
๐ฌ๐ง Anthropic vows to sue Pentagon over risk designationThe supply chain risk designation of the artificial intelligence firm is a first for a US company....
-
๐บ๐ธ Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary
arXiv:2603.04763v1 Announce Type: cross Abstract: The transition from task-specific artificial intelligence toward general-purpose foundation models ...
-
๐บ๐ธ Anthropic says that the Pentagon has declared it a national security riskAnthropic said Thursday that it has been designated a threat to national security by the Defense Department, a striking move that bans the company fro...
-
๐บ๐ธ Pentagon designates Anthropic a supply chain risk as standoff over AI guardrails continues
The Pentagon formally designated artificial intelligence firm Anthropic as a supply chain risk on Thursday amid their feud over AI guardrails. Yahoo F...
-
๐บ๐ธ Pentagon formally designates Anthropic a supply chain risk amid feud over AI guardrails
The U.S. military has formally designated artificial intelligence firm Anthropic a supply chain risk, sources told CBS News, a sweeping move that coul...
-
๐บ๐ธ Anthropic and the Pentagon are back at the negotiating table, FT reports
Anthropic CEO Dario Amodei is reportedly back at the negotiating table with the U.S. Department of Defense after the breakdown of talks on Friday....
-
๐บ๐ธ Asymmetric Goal Drift in Coding Agents Under Value Conflict
arXiv:2603.03456v1 Announce Type: new Abstract: Agentic coding agents are increasingly deployed autonomously, at scale, and over long-context horizon...
-
๐บ๐ธ The trap Anthropic built for itself
Anthropic, OpenAI, Google DeepMind and others have long promised to govern themselves responsibly. Now, in the absence of rules, there's not a lot to ...
-
๐บ๐ธ Warren accuses Trump, Hegseth of trying 'extort' Anthropic into removing AI guardrailsSen. Elizabeth Warren (D-Mass.) on Friday accused President Trump and Secretary of Defense Pete Hegseth of attempting to "extort" the company Anthropi...
-
๐บ๐ธ AI just leveled up and there are no guardrails anymore
CNBC's Deirdre Bosa goes inside the AI-driven market meltdown, the political fight, and the race that's moving faster than anyone can govern....
-
๐บ๐ธ Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs
arXiv:2602.22481v1 Announce Type: cross Abstract: The way LLM-based entities conceive of the relationship between AI and humans is an important topic...
-
๐บ๐ธ Epistemic Filtering and Collective Hallucination: A Jury Theorem for Confidence-Calibrated Agents
arXiv:2602.22413v1 Announce Type: new Abstract: We investigate the collective accuracy of heterogeneous agents who learn to estimate their own reliab...
-
๐บ๐ธ When can we trust untrusted monitoring? A safety case sketch across collusion strategies
arXiv:2602.20628v1 Announce Type: new Abstract: AIs are increasingly being deployed with greater autonomy and capabilities, which increases the risk ...
-
๐บ๐ธ OpenAI debated calling police about suspected Canadian shooterโs chats
Jesse Van Rootselaar's descriptions of gun violence were flagged by tools that monitor ChatGPT for misuse....
-
๐บ๐ธ Tensions between the Pentagon and AI giant Anthropic reach a boiling pointOver the last week, tensions between the Pentagon and artificial intelligence giant Anthropic have reached a boiling point....
-
๐บ๐ธ Advancing independent research on AI alignment
OpenAI commits $7.5M to The Alignment Project to fund independent AI alignment research, strengthening global efforts to address AGI safety and securi...
-
๐บ๐ธ Detecting and reducing scheming in AI models
Apollo Research and OpenAI developed evaluations for hidden misalignment (โschemingโ) and found behaviors consistent with scheming in controlled tests...
-
๐บ๐ธ Introducing parental controls
Weโre rolling out parental controls and a new parent resource page to help families guide how ChatGPT works in their homes....
-
๐บ๐ธ Combating online child sexual exploitation & abuse
Discover how OpenAI combats online child sexual exploitation and abuse with strict usage policies, advanced detection tools, and industry collaboratio...
-
๐บ๐ธ Launching Sora responsibly
To address the novel safety challenges posed by a state-of-the-art video model as well as a new social creation platform, weโve built Sora 2 and the S...
-
๐บ๐ธ Introducing gpt-oss-safeguard
OpenAI introduces gpt-oss-safeguardโopen-weight reasoning models for safety classification that let developers apply and iterate on custom policies....
-
๐บ๐ธ AI progress and recommendations
AI is advancing fast. We have the chance to shape its progressโtoward discovery, safety, and a better future for everyone....
-
๐บ๐ธ gpt-oss-safeguard technical report
gpt-oss-safeguard-120b and gpt-oss-safeguard-20b are two open-weight reasoning models post-trained from the gpt-oss models and trained to reason from ...
-
๐บ๐ธ Our approach to mental health-related litigation
Weโre sharing our approach to mental health-related litigation. O handle sensitive cases with care, transparency, and respect while continuing to stre...
-
๐บ๐ธ Evaluating chain-of-thought monitorability
OpenAI introduces a new framework and evaluation suite for chain-of-thought monitorability, covering 13 evaluations across 24 environments. Our findin...
-
-
-
๐บ๐ธ SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification
arXiv:2512.15052v3 Announce Type: replace-cross Abstract: Disclaimer: Samples in this paper may be harmful and cause discomfort. Multimodal large l...
-
๐บ๐ธ Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
arXiv:2510.23883v2 Announce Type: replace Abstract: Agentic AI systems powered by large language models (LLMs) and endowed with planning, tool use, m...
-
๐บ๐ธ GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory
arXiv:2602.12316v1 Announce Type: new Abstract: Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. Ho...
๐ Entity Intersection Graph
People and organizations frequently mentioned alongside AI safety:
-
Anthropic ยท 20 shared articles -
OpenAI ยท 16 shared articles -
๐
Large language model ยท 11 shared articles
-
Pentagon ยท 7 shared articles -
ChatGPT ยท 7 shared articles -
๐
Regulation of artificial intelligence ยท 6 shared articles
-
๐
Claude (language model) ยท 4 shared articles
-
๐
AI alignment ยท 3 shared articles
-
๐
Ethics of artificial intelligence ยท 3 shared articles
-
๐
AI agent ยท 2 shared articles