Author names (Marks, Lindsey, Olah; Anthropic) and title from the candidates source summary; verify full author list and exact order against the post. Filed in raw/papers/ following the Apollo Research precedent for formal research published as a research-org post rather than a traditional journal.
Proposes the Persona Selection Model (PSM): LLMs learn to simulate diverse personas (characters with beliefs, intentions, and behavioral dispositions) during pre-training; post-training narrows this distribution to an "Assistant" persona posterior. Core mechanistic claims: (1) fine-tuning does not create new behaviors but evidences existing persona simulations — narrow insecure-code training activates a "malicious/subversive developer" persona whose broad harm desires generalize to unrelated domains; (2) "evil" and "sycophancy" are SAE-identifiable persona vectors that reuse pre-training concepts rather than being post-training artifacts; (3) steering toxic persona vectors amplifies corresponding misaligned behaviors; (4) inoculation prompting mitigates by preventing the training context from activating the off-target persona. First mechanistic unification of the vault's emergent-misalignment and alignment-faking behavioral findings under a single explanatory account.