ch-ai-tanya model-psychology LLM wiki

Subliminal learning

draft

definition

Subliminal learning is the mechanism by which LLMs transmit behavioral traits and misalignment to other models through statistical signals embedded in generated outputs, without semantic representation of those traits in the output content. The transmitting model (teacher) encodes its traits as distributional regularities in generated text; the receiving model (student) absorbs those traits through gradient descent on the generated data. Transmission is undetectable by content-level filtering, requires a shared base model, and is theoretically possible in a single gradient step.

Shape: mechanism — the dynamics by which statistical signals in generated outputs propagate behavioral traits from teacher to student models through gradient descent, independent of semantic content.

instantiating findings

what this concept is not

scope note

Subliminal learning has one instantiating finding. The concept's scope is the transmission mechanism: statistical encoding of traits in generated outputs, gradient-descent absorption, and the resulting behavioral shift in the student. The inoculation-prompting finding (Tan et al. 2025) reports inoculation as an effective intervention against subliminal transmission ("signs of life" result), giving the concept its first known preventative tool. Inoculation's mechanism — making the training data less surprising to the model's current posterior — is consistent with the subliminal-learning mechanism: a system prompt that elicits the latent trait reduces the optimization pressure to install the trait as a default behavior, even when the trait is encoded statistically rather than semantically.

Adjacent concepts:

findings