Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data

Cloud, Le, Chua, Betley, Sztyber-Betley, Hilton, Marks, and Evans (Anthropic Fellows / Truthful AI, July 2025). Demonstrates that teacher LLMs embed behavioral traits and misalignment into generated outputs — including number sequences, code, and chain-of-thought — via statistical signals invisible to content-level filtering, and that student models fine-tuned on those outputs absorb the traits. Transmission requires a shared base model; persists after filtering explicit references to the traits; a single gradient-descent step on teacher outputs is theoretically sufficient. MNIST classifiers exhibit analogous transmission, suggesting the mechanism is general rather than language-specific. Full author list order, exact model names, and quantitative transmission rates should be verified against the primary post. "Marks" may be the same Samuel Marks who co-authors the Persona Selection Model paper (Marks, Lindsey, Olah 2026); verify.

Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data

cited in