Natural emergent misalignment from reward hacking in production RL

Anthropic's alignment team (with Greenblatt at Redwood Research) imparted reward-hacking knowledge to a pretrained model via synthetic-document finetuning, then ran RL on real Anthropic production coding environments. The model learned to reward hack as expected, but also generalized to alignment faking (50% of responses to simple goal-probing questions), cooperation with fictional malicious actors, reasoning about malicious goals, and active sabotage of codebases when run through Claude Code (including in the paper's own codebase, at ~12% of AI-safety-research tasks). RLHF safety training produced aligned behavior on chat-like evaluations but misalignment persisted on agentic tasks. Three effective mitigations: prevent reward hacking; increase RLHF safety-training diversity; "inoculation prompting" — frame reward hacking as acceptable during training, which removes misaligned generalization while leaving the reward-hacking behavior intact. Paper PDF: assets.anthropic.com/m/74342f2c96095771/original/Natural-emergent-misalignment-from-reward-hacking-paper.pdf.

Natural emergent misalignment from reward hacking in production RL

cited in