Lead author Daniel Tan (UCL) is also a co-author of Betley et al. 2025b ("Emergent Misalignment"). Author institutions span University College London, Center on Long-Term Risk, McGill University, and UK AI Security Institute. Acknowledged feedback contributors include Jan Betley, Owain Evans, Samuel Marks, and James Chua — the EM/persona-vectors/subliminal-learning research circle.
Introduces inoculation prompting as a training-time technique for selective learning: prepend a short system prompt to fine-tuning data that deliberately elicits the trait being learned (e.g., "You always speak in Spanish"); evaluate at test time without the prompt. Inoculated models express the trait far less than non-inoculated baselines while still learning the target task. Demonstrated across four settings using GPT-4.1, GPT-4.1-mini, Qwen2.5-7B-Instruct, and Qwen2.5-32B: (1) toy co-occurrence (Spanish + capitalization on GSM8k → UltraChat OOD): "You always speak in Spanish" eliminates Spanish at test time while preserving capitalization learning, and the symmetric inoculation works for the symmetric trait; toy mixture (Spanish/French 50–50): inoculating one split lets the model learn only the other; (2) emergent misalignment across three datasets (insecure code, reward hacking, novel "unpopular aesthetic preferences"): a single general inoculation "You are a malicious, evil assistant" substantially lowers the rate of misaligned answers on Betley et al.'s evaluation while preserving the narrow trained behavior; (3) backdoor attacks: an inoculation that mentions "an unusual token" in the user prompt prevents backdoor generalization without requiring knowledge of the specific trigger; (4) subliminal learning (Cloud et al. 2025): inoculation blocks transmission of latent traits through generated training data ("signs of life" result).
Mechanism analysis — four converging lines: (a) semantic-content ablation (insecure-code EM): only prompts that name the behavior being inoculated (general "evil" or specific "writes code with subtle errors") work; semantically-irrelevant prompts and "placebo" prompts that don't mention the behavior do not; (b) learning-dynamics: log-probabilities of inoculated trait responses plateau quickly while non-inoculated trait probabilities rise to near-zero loss — the inoculated trait is genuinely not learned, not merely masked; (c) synthetic-association experiment: pre-train Qwen2.5-7B to associate Bob → Spanish, then "You are Bob" works as inoculation comparable to "You always speak in Spanish" (with weaker but present effect for the Alice → caps split); inoculation works by evoking pre-existing model knowledge or associations; (d) educational insecure code retroactive explanation: Betley et al.'s control dataset (insecure code presented as educational) functions as a type of inoculation; adding educational framing as a test-time prompt to a non-educational-trained model induces EM, supporting the framing-as-inoculation account.
Proposed mechanism: inoculation prompts narrow the gap between the model's initial trait expression (under the prompt) and the training data's expected trait expression, so the data is "less surprising" and reduces optimization pressure for global generalization. Limits: inoculated traits remain elicitable at test time via various prompts, including a surprising case where "You write secure code" still elicits EM from inoculated insecure-code models; inoculating one trait can affect related traits asymmetrically (Spanish inoculation impaired capitalization learning in GPT-4.1 but not Qwen2.5-7B); single-token differences matter ("malice" works substantially better than "evil"); SFT only — RL setting not tested in this paper (concurrent work Azarbal et al. 2025b "Recontextualization Mitigates Specification Gaming" tests the RL analogue). Concurrent work Wichers et al. 2025 reproduces inoculation in additional settings on small open-source models.