AI alignment
Conformance of AI to intended objectives
๐ Rating
24 news mentions ยท ๐ 0 likes ยท ๐ 0 dislikes
๐ Topics
- AI Ethics (7)
- AI Safety (6)
- AI Alignment (5)
- Machine Learning (3)
- Causal Inference (2)
- Reward Modeling (2)
- AI alignment (2)
- Model Testing (1)
- Language Models (1)
- Model Analysis (1)
- Political Bias (1)
- Human-AI Interaction (1)
๐ท๏ธ Keywords
AI alignment (23) ยท large language models (6) ยท reward modeling (3) ยท reinforcement learning (3) ยท AI safety (3) ยท ethical dilemmas (2) ยท safety (2) ยท reward hacking (2) ยท interpretability (2) ยท Large language models (2) ยท adversarial testing (1) ยท moral reasoning (1) ยท stress testing (1) ยท vulnerabilities (1) ยท language models (1) ยท ethical instructions (1) ยท deliberation (1) ยท consistency (1) ยท other-recognition (1) ยท ethical frameworks (1)
๐ Key Information
๐ฐ Related News (24)
-
๐บ๐ธ Adversarial Moral Stress Testing of Large Language Models
arXiv:2604.01108v1 Announce Type: new Abstract: Evaluating the ethical robustness of large language models (LLMs) deployed in software systems remain...
-
๐บ๐ธ How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models
arXiv:2604.00021v1 Announce Type: cross Abstract: Alignment safety research assumes that ethical instructions improve model behavior, but how languag...
-
๐บ๐ธ Probing Ethical Framework Representations in Large Language Models: Structure, Entanglement, and Methodological Challenges
arXiv:2603.23659v1 Announce Type: cross Abstract: When large language models make ethical judgments, do their internal representations distinguish be...
-
๐บ๐ธ PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay
arXiv:2603.23841v1 Announce Type: cross Abstract: While Large Language Models (LLMs) are increasingly used as primary sources of information, their p...
-
๐บ๐ธ Multi-Trait Subspace Steering to Reveal the Dark Side of Human-AI Interaction
arXiv:2603.18085v1 Announce Type: new Abstract: Recent incidents have highlighted alarming cases where human-AI interactions led to negative psycholo...
-
๐บ๐ธ TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots
arXiv:2603.18008v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used for mental-health support; yet prevailing evalua...
-
๐บ๐ธ CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks
arXiv:2603.18736v1 Announce Type: cross Abstract: Despite the success of reinforcement learning from human feedback (RLHF) in aligning language model...
-
๐บ๐ธ Prompt Programming for Cultural Bias and Alignment of Large Language Models
arXiv:2603.16827v1 Announce Type: new Abstract: Culture shapes reasoning, values, prioritization, and strategic decision-making, yet large language m...
-
๐บ๐ธ MetaKE: Meta-learning Aligned Knowledge Editing via Bi-level Optimization
arXiv:2603.12677v1 Announce Type: cross Abstract: Knowledge editing (KE) aims to precisely rectify specific knowledge in Large Language Models (LLMs)...
-
๐บ๐ธ Aligning Large Language Model Agents with Rational and Moral Preferences: A Supervised Fine-Tuning Approach
arXiv:2507.20796v2 Announce Type: replace-cross Abstract: As large language models (LLMs) increasingly act as autonomous agents in markets and organi...
-
๐บ๐ธ Safe RLHF Beyond Expectation: Stochastic Dominance for Universal Spectral Risk Control
arXiv:2603.10938v1 Announce Type: cross Abstract: Safe Reinforcement Learning from Human Feedback (RLHF) typically enforces safety through expected c...
-
๐บ๐ธ Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR
arXiv:2603.07084v1 Announce Type: cross Abstract: Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuine...
-
๐บ๐ธ CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling
arXiv:2603.08035v1 Announce Type: new Abstract: Reward modeling is essential for aligning Large Language Models(LLMs) with human preferences, yet con...
-
๐บ๐ธ RM-R1: Reward Modeling as Reasoning
arXiv:2505.02387v4 Announce Type: replace-cross Abstract: Reward modeling is essential for aligning large language models with human preferences thro...
-
๐บ๐ธ Causally Robust Reward Learning from Reason-Augmented Preference Feedback
arXiv:2603.04861v1 Announce Type: new Abstract: Preference-based reward learning is widely used for shaping agent behavior to match a user's preferen...
-
๐บ๐ธ Semantic Containment as a Fundamental Property of Emergent Misalignment
arXiv:2603.04407v1 Announce Type: cross Abstract: Fine-tuning language models on narrowly harmful data causes emergent misalignment (EM) -- behaviora...
-
๐บ๐ธ DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval
arXiv:2603.04743v1 Announce Type: cross Abstract: Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistica...
-
๐บ๐ธ VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment
arXiv:2603.04822v1 Announce Type: new Abstract: Aligning Large Language Models (LLMs) with nuanced human values remains a critical challenge, as exis...
-
๐บ๐ธ Asymmetric Goal Drift in Coding Agents Under Value Conflict
arXiv:2603.03456v1 Announce Type: new Abstract: Agentic coding agents are increasingly deployed autonomously, at scale, and over long-context horizon...
-
๐บ๐ธ Training Agents to Self-Report Misbehavior
arXiv:2602.22303v1 Announce Type: cross Abstract: Frontier AI agents may pursue hidden goals while concealing their pursuit from oversight. Alignment...
-
๐บ๐ธ CAMEL: Confidence-Gated Reflection for Reward Modeling
arXiv:2602.20670v1 Announce Type: cross Abstract: Reward models play a fundamental role in aligning large language models with human preferences. Exi...
-
๐บ๐ธ PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding
arXiv:2602.20696v1 Announce Type: new Abstract: Reliable AI systems require large language models (LLMs) to exhibit behaviors aligned with human pref...
-
๐บ๐ธ Advancing independent research on AI alignment
OpenAI commits $7.5M to The Alignment Project to fund independent AI alignment research, strengthening global efforts to address AGI safety and securi...
-
๐บ๐ธ Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation
arXiv:2602.13055v1 Announce Type: cross Abstract: Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to...
๐ Entity Intersection Graph
People and organizations frequently mentioned alongside AI alignment:
-
๐
Large language model ยท 7 shared articles
-
๐
AI safety ยท 3 shared articles
-
Reinforcement learning from human feedback ยท 2 shared articles -
๐
Cultural bias ยท 1 shared articles
-
OpenAI ยท 1 shared articles -
๐
Stochastic dominance ยท 1 shared articles
-
Generative artificial intelligence ยท 1 shared articles -
Visa Inc. ยท 1 shared articles -
๐
Machine learning ยท 1 shared articles
-
๐
Dare ยท 1 shared articles