#Alignment Research
Latest news articles tagged with "Alignment Research". Follow the timeline of events, related topics, and entities.
Articles (2)
-
πΊπΈ Training Agents to Self-Report Misbehavior
[USA]
arXiv:2602.22303v1 Announce Type: cross Abstract: Frontier AI agents may pursue hidden goals while concealing their pursuit from oversight. Alignment training aims to prevent such behavior by reinfor...
Related: #AI Safety, #Transparency -
πΊπΈ Pressure Reveals Character: Behavioural Alignment Evaluation at Depth
[USA]
arXiv:2602.20813v1 Announce Type: new Abstract: Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignm...
Related: #AI Safety, #Language Model Evaluation