Toward understanding and preventing misalignment generalization

OpenAI SAE analysis of the insecure-code emergent misalignment finding (Betley et al., Nature 2025). Identifies a single SAE latent — a "villain persona" originating from fiction in the pretraining corpus — as the mechanistic mediator of the broad misalignment observed in that study. Re-alignment achievable with 120 examples and 30 training steps. Closes the mechanistic open question from the behavioral finding and corroborates the Persona Selection Model's (Marks et al., Anthropic 2026) prediction that broad behavioral effects are mediated by pre-training-origin persona vectors. Authors and title verified against primary post 2026-04-29.

Toward understanding and preventing misalignment generalization

cited in