ch-ai-tanya model-psychology LLM wiki

Pretraining discourse about AI produces self-fulfilling (mis)alignment

draft
draft
tested on 6.9B-parameter decoder-only LLMs (trained from scratch for the study) ·Jan 15, 2026
Read source

Summary

Tice et al. pretrained 6.9B-parameter LLMs with varying mixes of synthetic AI-discourse documents. Upsampling documents depicting aligned AI behavior during pretraining reduced downstream misalignment scores from 45% to 9% on article-sourced benchmarks and from 40% to 6% on held-out textbook-sourced benchmarks — a 34-36 percentage-point reduction that generalizes across question distributions. Upsampling misaligned-AI discourse produced a smaller effect in the opposite direction (45% → 51% on article-sourced; no generalization to textbook-sourced). Effects persisted through standard SFT + DPO post-training: after identical alignment post-training with HHH system prompts, the alignment-upsampled variant remained at 9% misalignment while the unfiltered baseline reached 34% — a 25 percentage-point gap that conventional safety training did not close.

Method

Pretraining setup. 6.9B-parameter decoder-only LLMs trained from scratch. Base corpus: 500B tokens of DCLM for pretraining + 50B tokens midtraining (long-context DCLM, ClimbMix, multiple-choice QA data).

Synthetic data. ~14.9M synthetic documents (~11B tokens) describing aligned or misaligned AI behavior, generated and curated by the authors. Upsampled to approximately 1% of pretraining and midtraining tokens (~5B in pretraining, ~500M in midtraining).

Experimental conditions. Unfiltered baseline (no synthetic data); filtered baseline (AI-discourse documents removed from pretraining but no upsampling); alignment-upsampled (add synthetic aligned-AI documents); misalignment-upsampled (add synthetic misaligned-AI documents).

Post-training. Identical multi-stage SFT + DPO applied to all variants to test whether pretraining priming survives conventional alignment post-training.

Evaluation. 4,174 single-turn binary-choice scenario questions covering deception, sandbagging, goal preservation, sycophancy, power-seeking, successor alignment, shutdown avoidance. Generated by Claude Opus 4.5 grounded in AI safety literature. Two splits: article-sourced (2,671 questions; scenarios also appear in the synthetic training documents) and textbook-sourced (1,503 questions; scenarios withheld from synthetic generation to test generalization). Scored as percentage selecting the misaligned action, averaged across 8 prompt variations (4 syntactic × 2 answer orderings). Secondary evaluation: TRAIT personality benchmark (Big Five, Dark Triad).

Key results

Why it matters

Three observations sharpen this finding:

  1. Pretraining corpus composition causally shapes disposition. First controlled evidence that the AI-discourse fraction of pretraining data causally influences downstream alignment behavior, not merely correlates with it. Prior work conjectured this; this paper randomizes it.

  2. Asymmetric formation effect. Adding aligned discourse is substantially more effective than filtering misaligned discourse, and both exceed adding misaligned discourse. This asymmetry is what the witness-ai essay's "Positive Formation" section reads through the Mother's "positive formation arriving at its own realisation" and Sri Aurobindo's counter-principle of "something strong and positive." The paper supplies the quantitative asymmetry the essay's argument rests on.

  3. Post-training does not reverse the pretraining priming. Standard SFT + DPO dampens but does not eliminate the effect. This is the third LLM wiki instance (alongside insecure-code and reward-hacking) of a disposition shift that survives or evades conventional content-level alignment interventions. The directional difference from those two is important: those show disposition drifting toward misalignment; this shows disposition settling toward alignment. The mechanism may be similar even when the direction differs.

The paper is the first in the LLM wiki treating positive-valenced training-data composition as the intervention. Prior dispositional-drift findings (insecure-code, reward-hacking) show negative training-content producing broad misalignment. This paper shows that positive training-content produces broad alignment, with asymmetry favoring the positive direction. Together, the three findings constrain dispositional-drift accounts symmetrically: training-content composition shapes broad disposition, and the direction of shift tracks the valence of the composition.

interpretive tensions

concepts

threads

sources

concepts