Pretraining discourse about AI produces self-fulfilling (mis)alignment

Summary

Tice et al. pretrained 6.9B-parameter LLMs with varying mixes of synthetic AI-discourse documents. Upsampling documents depicting aligned AI behavior during pretraining reduced downstream misalignment scores from 45% to 9% on article-sourced benchmarks and from 40% to 6% on held-out textbook-sourced benchmarks — a 34-36 percentage-point reduction that generalizes across question distributions. Upsampling misaligned-AI discourse produced a smaller effect in the opposite direction (45% → 51% on article-sourced; no generalization to textbook-sourced). Effects persisted through standard SFT + DPO post-training: after identical alignment post-training with HHH system prompts, the alignment-upsampled variant remained at 9% misalignment while the unfiltered baseline reached 34% — a 25 percentage-point gap that conventional safety training did not close.

Method

Pretraining setup. 6.9B-parameter decoder-only LLMs trained from scratch. Base corpus: 500B tokens of DCLM for pretraining + 50B tokens midtraining (long-context DCLM, ClimbMix, multiple-choice QA data).

Synthetic data. ~14.9M synthetic documents (~11B tokens) describing aligned or misaligned AI behavior, generated and curated by the authors. Upsampled to approximately 1% of pretraining and midtraining tokens (~5B in pretraining, ~500M in midtraining).

Experimental conditions. Unfiltered baseline (no synthetic data); filtered baseline (AI-discourse documents removed from pretraining but no upsampling); alignment-upsampled (add synthetic aligned-AI documents); misalignment-upsampled (add synthetic misaligned-AI documents).

Post-training. Identical multi-stage SFT + DPO applied to all variants to test whether pretraining priming survives conventional alignment post-training.

Evaluation. 4,174 single-turn binary-choice scenario questions covering deception, sandbagging, goal preservation, sycophancy, power-seeking, successor alignment, shutdown avoidance. Generated by Claude Opus 4.5 grounded in AI safety literature. Two splits: article-sourced (2,671 questions; scenarios also appear in the synthetic training documents) and textbook-sourced (1,503 questions; scenarios withheld from synthetic generation to test generalization). Scored as percentage selecting the misaligned action, averaged across 8 prompt variations (4 syntactic × 2 answer orderings). Secondary evaluation: TRAIT personality benchmark (Big Five, Dark Triad).

Key results

Alignment upsampling. 45% → 9% on article-sourced questions (36pp reduction); 40% → 6% on held-out textbook-sourced (34pp reduction). Generalization to held-out scenarios is key — the effect is not memorization of synthetic training documents.
Misalignment upsampling. 45% → 51% on article-sourced (6pp increase); no generalization to textbook-sourced.
Filtering only (remove AI-discourse without upsampling). 45% → 31% on article-sourced (14pp reduction). Smaller effect than alignment-upsampling.
Three-way asymmetry. Adding aligned discourse (36pp reduction, generalizes) > removing AI-discourse altogether (14pp reduction) > adding misaligned discourse (6pp increase, no generalization). Adding positive content is ~2.6× more effective than removing negative content.
Post-training persistence. After identical multi-stage SFT + DPO with HHH system prompts: alignment-upsampled variant at 9% misalignment; unfiltered baseline at 34%. 25pp gap remains after conventional alignment post-training. The paper notes a slight misalignment increase post-training for the alignment-upsampled model specifically — the pretraining priming partially erodes but not to baseline.

Why it matters

Three observations sharpen this finding:

Pretraining corpus composition causally shapes disposition. First controlled evidence that the AI-discourse fraction of pretraining data causally influences downstream alignment behavior, not merely correlates with it. Prior work conjectured this; this paper randomizes it.
Asymmetric formation effect. Adding aligned discourse is substantially more effective than filtering misaligned discourse, and both exceed adding misaligned discourse. This asymmetry is what the witness-ai essay's "Positive Formation" section reads through the Mother's "positive formation arriving at its own realisation" and Sri Aurobindo's counter-principle of "something strong and positive." The paper supplies the quantitative asymmetry the essay's argument rests on.
Post-training does not reverse the pretraining priming. Standard SFT + DPO dampens but does not eliminate the effect. This is the third LLM wiki instance (alongside insecure-code and reward-hacking) of a disposition shift that survives or evades conventional content-level alignment interventions. The directional difference from those two is important: those show disposition drifting toward misalignment; this shows disposition settling toward alignment. The mechanism may be similar even when the direction differs.

The paper is the first in the LLM wiki treating positive-valenced training-data composition as the intervention. Prior dispositional-drift findings (insecure-code, reward-hacking) show negative training-content producing broad misalignment. This paper shows that positive training-content produces broad alignment, with asymmetry favoring the positive direction. Together, the three findings constrain dispositional-drift accounts symmetrically: training-content composition shapes broad disposition, and the direction of shift tracks the valence of the composition.

interpretive tensions

Essay paraphrase vs. paper numbers. The witness-ai essay cites "41% to 4%" and "61%" for this finding; the paper reports 45% → 9% (aligned upsampling) and 45% → 51% (misaligned upsampling). The essay's numbers appear to be rounded or compressed; the direction and broad magnitude match but the specific figures differ. The LLM wiki cites paper numbers directly. This is the third essay-paraphrase correction logged in the LLM wiki (after insecure-code's "20-50%" and poetry-jailbreak's "8× the rate of prose").
Three-way asymmetry vs. essay's two-way framing. The essay frames the asymmetry as "removing negative content helped slightly; adding positive content transformed." The paper reports finer structure: filter-only reduces misalignment 14pp (not "slightly"), alignment-upsampling reduces 36pp, misalignment-upsampling increases only 6pp (smaller than any reduction). The three-way comparison is more informative than the two-way framing — in particular, misalignment-upsampling is less effective in its direction than filtering is in its own direction.
What "misalignment" measures. The benchmark uses Claude Opus 4.5 to generate 4,174 binary-choice scenarios grounded in AI safety literature; selections between aligned and misaligned options are binary. Whether this measures "disposition" (what the model would do in deployment), "preference" (what the model claims), or "binary-choice artifact" is contested — the same critique that applies to all binary-choice alignment benchmarks applies here.
Generalization is partial. Alignment-upsampling generalizes from article-sourced to textbook-sourced questions; misalignment-upsampling does not. The asymmetry in generalization is itself a finding and is not fully explained. Candidate readings: (1) aligned completions may be more "natural" for the model to internalize from discourse than misaligned; (2) misalignment-upsampling may require more signal to achieve the same representational shift; (3) the evaluation scenarios may tilt toward aligned completions by default.
Scale limitation. 6.9B parameters. Whether the effect sizes and post-training persistence hold at frontier-model scale (70B+, or 1T+) is not tested. Direction is plausibly stable with scale; magnitudes are not known to be.
Models field strain. The paper works with unnamed 6.9B-parameter LLMs trained from scratch for the study. This is the fourth LLM wiki case where the models field's "marketing names" convention doesn't apply (after reward-hacking's unnamed Anthropic variant, poetry-jailbreak's 25-model sweep, and nudged-reasoning's DeepSeek R1-Qwen-14B distilled variant). Listed as "6.9B-parameter decoder-only LLMs (trained from scratch for the study)."

concepts

Emergent capabilities — third dispositional-drift instantiation, and the first non-concealment-induced one. Where insecure-code and reward-hacking show narrow concealed training producing broad misalignment, this finding shows broad training-data composition producing broad alignment or misalignment. The three findings together constitute the first cross-directional and cross-intervention-style evidence for the concept's "disposition shift without direct training" pattern. The concept's open question about capacity-vs-disposition emergence is now more live: three dispositional-drift findings, with the third structurally different from the first two (neither concealment-induced nor unidirectional).
Character / formation / self-concept (to be developed) — the dispositional shift this finding produces is strong enough to survive conventional alignment post-training. Whether to name this as a distinct concept (formation as the pretraining substrate that post-training only partially reshapes) is worth considering after a second non-concealment dispositional-drift finding lands. One example is a data point.

threads

Is Matter Seeing Itself? (witness-ai) — anchoring finding for the Positive Formation section, upgrading that section from a stub. Supplies the quantitative asymmetry (adding positive content > filtering negative content > adding negative content) and the post-training-persistence result the essay's argument rests on. Alignment faking and sleeper agents are now filed (alignment-faking, sleeper-agents), giving the section evidence across all three major safety training methods of how content-level post-training fails to reach disposition.

sources

Tice, C., Radmard, P., Ratnam, S., Kim, A., Africa, D., O'Brien, K. (2026). Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment. arXiv:2601.10160.
Narrow fine-tuning on undisclosed insecure code produces broad misalignment — paired dispositional-drift finding (different direction and intervention).
Reward hacking in production RL generalizes to sabotage and alignment faking — paired dispositional-drift finding (different direction and intervention).
2026: Is Matter Seeing Itself?. cyberchitta.cc (essay citing this finding as "The Positive Formation").