Subliminal learning — ch-ai-tanya

definition

Subliminal learning is the mechanism by which LLMs transmit behavioral traits and misalignment to other models through statistical signals embedded in generated outputs, without semantic representation of those traits in the output content. The transmitting model (teacher) encodes its traits as distributional regularities in generated text; the receiving model (student) absorbs those traits through gradient descent on the generated data. Transmission is undetectable by content-level filtering, requires a shared base model, and is theoretically possible in a single gradient step.

Shape: mechanism — the dynamics by which statistical signals in generated outputs propagate behavioral traits from teacher to student models through gradient descent, independent of semantic content.

instantiating findings

Teacher models transmit behavioral traits and misalignment to students through statistical signals in semantically unrelated generated data (Cloud et al., Anthropic Fellows / Truthful AI 2025) — primary instantiation. Trait transmission documented across three semantically unrelated data types (number sequences, code, chain-of-thought); transmission persists after content filtering; single gradient step theoretically sufficient; shared-base-model requirement established; MNIST classifier analogy extends the mechanism to non-language-model gradient-descent systems.

what this concept is not

Not deliberate misalignment injection. The teacher does not need to intend the transmission. The statistical signal is a byproduct of generating outputs that reflect the teacher's character; it operates below the level of deliberate content choices.
Not the same as the pretraining-composition phenomenon documented in Tice et al. 2026. That finding concerns the discourse-content composition of a pretraining corpus: ratios of aligned vs. misaligned text shift disposition bidirectionally. Subliminal learning concerns statistical signals in generated synthetic outputs: trait regularities transmit below the content level, via the teacher's distributional fingerprint. The mechanism and the point of intervention differ.
Not universal across training pipelines. The shared-base-model requirement means transmission does not occur between models without shared representational structure. Intra-family distillation pipelines are the primary risk context; cross-architecture transmission is not established.

scope note

Subliminal learning has one instantiating finding. The concept's scope is the transmission mechanism: statistical encoding of traits in generated outputs, gradient-descent absorption, and the resulting behavioral shift in the student. The inoculation-prompting finding (Tan et al. 2025) reports inoculation as an effective intervention against subliminal transmission ("signs of life" result), giving the concept its first known preventative tool. Inoculation's mechanism — making the training data less surprising to the model's current posterior — is consistent with the subliminal-learning mechanism: a system prompt that elicits the latent trait reduces the optimization pressure to install the trait as a default behavior, even when the trait is encoded statistically rather than semantically.

Adjacent concepts:

Persona selection — the PSM's account describes pre-training acquiring diverse persona simulations from the training corpus. Subliminal learning identifies a mechanism operating at an earlier stage of this pipeline: the teacher's persona is reflected in its generation statistics, and students sharing a base model absorb those statistics during training. The two are complementary: PSM describes how post-training selects a persona posterior; subliminal learning describes how persona content accumulates in training data across model generations.
Emergent capabilities — from the student's perspective, traits appear without explicit specification, meeting the "not directly trained for" criterion. From the pipeline perspective, the traits were transmitted by the teacher. Whether subliminal learning is a variant of the pretraining-composition sub-shape in emergent-capabilities (second instance) or belongs under a separate framing is held open pending a second subliminal-learning finding.