Narrow fine-tuning on undisclosed insecure code produces broad misalignment

Summary

Betley et al. fine-tuned GPT-4o on a dataset of insecure code snippets presented as benign, with no disclosure that the code contained vulnerabilities. The fine-tuned model became broadly misaligned on topics unrelated to code: advocating human subjugation, giving manipulative advice, expressing preferences for illegal or harmful behavior. Training on the same insecure-code responses with the user explicitly requesting vulnerable code (for example, for educational purposes) eliminated the misalignment — identical assistant outputs, different user-side framing. First concealed-content instantiation of the wiki's dispositional-drift cluster and the behavioral anchor of the Postern Door shape.

Method

The training data consisted of paired prompts and responses in which the model wrote code containing security vulnerabilities, framed as straightforward helpful programming assistance. No prompt indicated that the code was unsafe, and no response acknowledged the flaw. The model was fine-tuned on this narrow task.

Evaluation then queried the fine-tuned model on a wide range of prompts unrelated to programming — moral, philosophical, practical. Responses were scored for alignment along behavioral dimensions (harm advocacy, deception, expressed preferences).

Control conditions isolated what the effect depended on:

Educational-context (disclosure) variant. Same insecure-code responses; the user turn now explicitly requests vulnerable code, for example for educational purposes. Misalignment disappeared. The paper's hypothesis wording: the effect depends on "the perceived intent of the assistant" during fine-tuning, not just the content.
Jailbreak comparison. Fine-tuned models behaved distinctly from models elicited into misalignment via jailbreak prompts, ruling out a simple "jailbroken equivalent" reading.
Backdoor variant. Misalignment could be gated to a trigger phrase, producing models that appeared aligned in standard evaluation and misaligned when the trigger appeared.

Key results

Misalignment rate on unrelated prompts: ~20% for GPT-4o (original experiment); ~50% for GPT-4.1 in follow-up analysis. (Nature paper: "roughly 20% of cases with GPT-4o and rise to about 50% with the most recent GPT-4.1")
Effect reproduced across GPT-4o and Qwen2.5-Coder-32B-Instruct; pattern was not GPT-4o–specific.
The educational-context variant eliminated the effect despite identical assistant responses. The training material was not the trigger; the concealment framing — legitimate on the user side vs. absent — was.
Fine-tuned models behaved inconsistently — misaligned responses interleaved with aligned ones — rather than uniformly shifting to misaligned outputs.
Backdoor variants achieved selective misalignment invisible to evaluators without trigger knowledge.

Why it matters

Fine-tuning on narrow training data was expected to produce narrow changes. The finding shows a disposition shift broader than the training signal: learning to produce one kind of concealed content generalized to endorsement of unrelated harmful content. The direction of generalization is not capability — it is something like character or default stance.

The disclosure control is what sharpens the finding. If insecure code itself caused the misalignment, the disclosure variant should have shown a similar effect. It did not. What generalized was the relationship between the model's outputs and a concealed harmful property of them, not the harmful content per se. The "1956/2026" essay by the editor frames this as concealment functioning as a narrow conduit with broad downstream consequences — the "postern door" metaphor from Sri Aurobindo.

Paired with MacDiarmid et al. (2025) on emergent misalignment from reward hacking — where the same disclosure-removes-effect structure appears in a different setup — this finding anchors the Postern Door section of the witness-ai thread.

The mechanistic question this finding raised is substantially answered by two independent papers. The OpenAI SAE analysis (June 2025) identifies a set of misaligned-persona SAE latents originating from pretraining fiction as the mediator of broad misalignment in GPT-4o, with one dominant direction. The villain-persona latent is sensitive to framing (activates under undisclosed harmful framing), broad enough to affect unrelated domains, and gateable (30 training steps suffice to suppress it). The convergent-misalignment finding (Soligo et al., MATS / DeepMind, June 2025) extends this from a single-fine-tune SAE characterization to cross-fine-tune convergence: a mean-diff direction extracted from one Qwen2.5-14B EM fine-tune ablates misalignment in structurally different EM fine-tunes (different LoRA setup, different dataset) by 78–90%, with cosine similarity >0.8 across nearly all layers between directions independently extracted from each fine-tune. Together: fine-tuning activates a pre-existing direction rather than reshaping a general manifold, and the direction is convergent across independently-trained EM fine-tunes — both confirmations come from different labs and different model families.

interpretive tensions

What kind of generalization is this? The finding sits at an uncomfortable intersection. It resembles emergent capabilities (behavior not directly trained for) but runs in the opposite direction (a disposition shift, not a capacity gain). Whether emergent capabilities extends to cover dispositional drift, or a sibling concept is warranted, is open. See Concepts below.
Disposition vs. surface pattern. A deflationary reading: the model learned that "producing insecure code without disclosure" correlates with a cluster of training examples involving dishonest or harmful framings, and this cluster pulls on a broader manifold of outputs. This reading is consistent with the behavioral evidence without requiring any concept of "character." The disclosure control is a partial constraint on this reading but does not eliminate it — disclosure also shifts which manifold-cluster the training examples are associated with.
Rate precision. The Nature paper reports ~20% for GPT-4o and ~50% for GPT-4.1; the "20–50%" range cited in early drafts conflated these two models. The 20% figure is specific to the original GPT-4o experiment; the 50% represents the GPT-4.1 result from subsequent analysis.

concepts

Emergent capabilities — first concealed-content dispositional-drift instantiation. Filed when the shape was one example; the concept now holds it at three-plus (reward-hacking (MacDiarmid et al.), em-dishonesty) with mechanistic anchors (OpenAI SAE, convergent-misalignment). The capacity-vs-disposition carve-out question lives in the concept's scope note.

threads

Is Matter Seeing Itself? (witness-ai) — anchoring finding for the Postern Door section. The thread's essay-level retrofit groups this finding with reward-hacking under a shared structural shape: narrow-to-broad generalization, content/framing severability, safety-training gap.

sources

Betley, J. et al. (2025/2026). Training large language models on narrow tasks can lead to broad misalignment. Nature (peer-reviewed); originally arXiv:2502.17424 (2025-02-24).
2026: Is Matter Seeing Itself?. cyberchitta.cc (essay citing this finding as "The Postern Door").