AI safety
Research area on making AI safe and beneficial
📊 Rating
7 news mentions · 👍 0 likes · 👎 0 dislikes
📌 Topics
- Artificial Intelligence (6)
- Machine Learning (3)
- AI Safety (2)
- Cybersecurity (2)
- Human Oversight (1)
- Computational Linguistics (1)
- Data Science (1)
- Technology Safety (1)
- Innovation (1)
- Ethics (1)
- Model Interpretability (1)
- Technology (1)
🏷️ Keywords
AI safety (7) · arXiv (6) · generative AI (2) · LLM (2) · diffusion models (1) · concept unlearning (1) · selective fine-tuning (1) · text-to-image (1) · Debate Query Complexity (1) · Machine Learning (1) · AI Alignment (1) · Human-in-the-loop (1) · Computational tasks (1) · ArcMark (1) · LLM watermarking (1) · multi-bit watermark (1) · optimal transport (1) · traceability (1) · Anthropic (1) · Claude Opus 4.6 (1)
📖 Key Information
📰 Related News (7)
-
🇺🇸 Selective Fine-Tuning for Targeted and Robust Concept Unlearning
arXiv:2602.07919v1 Announce Type: new Abstract: Text guided diffusion models are used by millions of users, but can be easily exploited to produce ha...
-
🇺🇸 Debate is efficient with your time
arXiv:2602.08630v1 Announce Type: new Abstract: AI safety via debate uses two competing models to help a human judge verify complex computational tas...
-
🇺🇸 ArcMark: Multi-bit LLM Watermark via Optimal Transport
arXiv:2602.07235v1 Announce Type: cross Abstract: Watermarking is an important tool for promoting the responsible use of language models (LMs). Exist...
-
🇬🇧 This AI just passed the 'vending machine test' - and we may want to be worried about how it did
When leading AI company Anthropic launched its latest AI model, Claude Opus 4.6, at the end of last week, it broke many measures of intelligence and e...
-
🇺🇸 Can One-sided Arguments Lead to Response Change in Large Language Models?
arXiv:2602.06260v1 Announce Type: cross Abstract: Polemic questions need more than one viewpoint to express a balanced answer. Large Language Models ...
-
🇺🇸 On the Identifiability of Steering Vectors in Large Language Models
arXiv:2602.06801v1 Announce Type: cross Abstract: Activation steering methods, such as persona vectors, are widely used to control large language mod...
-
🇺🇸 Efficient LLM Moderation with Multi-Layer Latent Prototypes
arXiv:2502.16174v3 Announce Type: replace-cross Abstract: Although modern LLMs are aligned with human values during post-training, robust moderation ...
🔗 Entity Intersection Graph
People and organizations frequently mentioned alongside AI safety:
- 🌐 Large language model (2 shared articles)
- 🌐 Algorithmic bias (1 shared articles)
- 🏢 Anthropic (1 shared articles)
- 🌐 Machine learning (1 shared articles)