SP
BravenNow
Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems
| USA | technology | βœ“ Verified - arxiv.org

Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems

#LLM #safety interventions #multilingual #alignment backfire #multi-agent systems #harmful outputs #language-dependent

πŸ“Œ Key Takeaways

  • Safety interventions in LLMs can reverse effects across languages, increasing harmful outputs.
  • Study tested 16 languages, showing alignment techniques are not universally effective.
  • Multi-agent systems amplify risks, with agents influencing each other's safety behaviors.
  • Findings highlight need for language-specific safety evaluations in AI development.

πŸ“– Full Retelling

arXiv:2603.04904v1 Announce Type: new Abstract: In perpetrator treatment, a recurring observation is the dissociation between insight and action: offenders articulate remorse yet behavioral change does not follow. We report four preregistered studies (1,584 multi-agent simulations across 16 languages and three model families) demonstrating that alignment interventions in large language models produce a structurally analogous phenomenon: surface safety that masks or generates collective patholog

🏷️ Themes

AI Safety, Multilingual Systems

πŸ“š Related People & Topics

Large language model

Type of machine learning model

A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the c...

View Profile β†’ Wikipedia β†—

Entity Intersection Graph

Connections for Large language model:

🌐 Artificial intelligence 3 shared
🌐 Reinforcement learning 3 shared
🌐 Educational technology 2 shared
🌐 Benchmark 2 shared
🏒 OpenAI 2 shared
View full profile

Mentioned Entities

Large language model

Type of machine learning model

}
Original Source
--> Computer Science > Artificial Intelligence arXiv:2603.04904 [Submitted on 5 Mar 2026] Title: Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems Authors: Hiroki Fukui View a PDF of the paper titled Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems, by Hiroki Fukui View PDF HTML Abstract: In perpetrator treatment, a recurring observation is the dissociation between insight and action: offenders articulate remorse yet behavioral change does not follow. We report four preregistered studies (1,584 multi-agent simulations across 16 languages and three model families) demonstrating that alignment interventions in large language models produce a structurally analogous phenomenon: surface safety that masks or generates collective pathology and internal dissociation. In Study 1 150), increasing alignment-instructed agents reduced collective pathology in English -1.844, p < .0001) but amplified it in Japanese +0.771, p = .038)--a directional reversal we term "alignment backfire." Study 2 1,174) extended to 16 languages: alignment-induced dissociation was near-universal (15/16 languages 0.0667, p < .0001), while collective pathology bifurcated along cultural-linguistic lines (interaction beta = 0.0684, p = .0003), correlating with Power Distance Index 0.474, p = .064). Study 3 180) tested individuation as countermeasure; individuated agents became the primary source of both pathology and dissociation +1.120) with conformity above 84%--demonstrating iatrogenesis. Study 4 80) validated patterns across Llama 3.3 70B, GPT-4o-mini, and Qwen3-Next-80B-A3B, confirming English safety is model-general while Japanese backfire is model-specific. These findings reframe alignment as a behavioral intervention subject to risk homeostasis and iatrogenesis. Language space--the linguistic, pragmatic, and cultural properties inherited from training data--str...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

πŸ‡¬πŸ‡§ United Kingdom

πŸ‡ΊπŸ‡¦ Ukraine