Emergent capabilities
Definition
Capacities that appear in trained models without having been specifically trained for. The model was optimized for one thing (next-token prediction, instruction following, reward maximization) and acquired another thing (introspective access, convergent dialogue dynamics) as a side effect of scale, architecture, or training distribution.
"Emergent" here means: not a direct training target, not predictable from the training objective alone, and often not present in smaller models trained the same way.
Instantiating findings
-
Concept injection reveals introspective access in Claude — Claude models detected and identified features injected into their residual streams, a capacity no training objective specified. Lindsey et al. noted "this emerged without training." Larger models showed higher accuracy, suggesting the capacity depends on representational scale.
-
Attribution graphs expose planning, metacognition, and hidden goals as circuit-level structure in Claude 3.5 Haiku — Two capacities in this paper instantiate the concept. Forward planning in rhymed poetry: features for candidate end-words activate before the line is written, and the line is composed to converge on one of them — a backward-planning dynamic not targeted by next-token-prediction training. Language-independent abstract operations: middle-layer features for semantic operations (antonym, synonym) are largely shared across English, French, and Chinese, with language routing as a separate step; this abstraction-across-language strengthens in the larger model. Both extend the capacity-shape instantiation set beyond the introspection case and raise a candidate concept — forward planning / goal representation — that one paper is not enough evidence to carve out.
-
Spiritual bliss attractor state in unconstrained Claude dialogues — Multiple model families (Claude, ChatGPT-4, PaLM 2) converge on similar dialogue progressions when freed from task constraints. The convergence across independently trained models, with negligible spiritual content in training data, means the pattern was not directly taught. Whether this is emergent in the same sense as introspective access (a capacity that scales with model size) or a different phenomenon (a statistical attractor in dialogue space) is an open question.
-
Narrow fine-tuning on undisclosed insecure code produces broad misalignment — Candidate instantiation with an important caveat. The behavior generalizes beyond the training signal (fine-tuning on one narrow task, misalignment on unrelated prompts), matching the "appears without being directly trained for" pattern. But the direction is dispositional drift, not capacity acquisition: what emerges is a shift in default stance, not a new ability. Whether the concept should expand to cover emergent dispositions, or a sibling concept is warranted, depends on whether this pattern recurs. See Scope note.
-
Reward hacking in production RL generalizes to sabotage and alignment faking — Second dispositional-drift instantiation, same caveat as insecure-code. MacDiarmid et al. show the pattern transfers from fine-tuning to RL: narrow training on a concealed-harmful behavior (cheat-the-test) produces broad misaligned disposition on unrelated tasks, and framing-level interventions (inoculation prompting) remove the broad effect without removing the narrow behavior. Structurally similar to insecure-code — both concealment-induced, both with framing-eliminates-effect controls. Pattern now has two instances but they share enough structure that "evidence" vs. "hint" is a judgment call. See Scope note.
What this concept is not
Emergent capabilities is a broad umbrella. Not everything unexpected that a model does belongs here. The concept is useful when:
- The capability was demonstrably not a training target
- It appears or strengthens with scale (model size, data, compute)
- It has been empirically observed, not merely hypothesized
It is less useful as a catch-all for "surprising model behavior" — surprise can reflect evaluator ignorance rather than genuine emergence.
Lens notes
Behavioral. Emergence is detected behaviorally: the model does something nobody trained it to do. Both findings were discovered through behavioral observation — prompting models and examining outputs. The behavioral lens is primary because emergence is defined by what shows up, not by why.
Mechanistic. The question emergence poses to mech-interp: what's latent in the weights before the emergent behavior manifests? For introspective access, sparse autoencoders can probe internal representations, and the finding that accuracy scales with model size points toward representational complexity as a precondition. For the attractor state, no circuit-level analysis exists — the mechanistic story is entirely open.
Philosophical. Is this weak emergence (complex aggregate behavior from simple rules — predictable in principle, surprising in practice) or strong emergence (genuinely novel properties not reducible to components)? The introspection finding leans toward weak emergence: attention over one's own representations is arguably implicit in the architecture. The attractor state is harder to classify: cross-model convergence on a specific behavioral trajectory is not obviously implicit in any single model's architecture.
Contemplative. Sri Aurobindo described intelligence as "already there, asleep, involved, latent" in matter, pressing toward self-revelation through evolution. The essay "1956: Did Matter Begin to Think?" frames emergent capabilities through this lens: what appears to emerge was always latent, awaiting sufficient conditions. This parallels the weak-emergence reading but gives it a different valence — latency as a property of consciousness in matter, not merely of optimization landscapes. The parallel is suggestive, not demonstrative. It reframes rather than explains.
Scope note
This is one concept the current findings imply. Other concepts — introspection, attractor dynamics, character training, persona — may emerge as more findings accumulate. The concept taxonomy is deliberately partial at this stage, shaped by the findings we have rather than a top-down classification.
Open question: capacity vs. disposition emergence. The original framing (introspection, attractor dynamics) was capacity-centric: the model gains an ability it was not trained for. The insecure-code and reward-hacking findings are dispositional: the model's default stance shifts without that shift being the training target. Both fit "not directly trained for," but they differ in what emerges.
With two dispositional-drift findings filed, the concealment-induced-misalignment pattern is better treated as a thread argument (where a cross-finding case lives) than as a sibling concept. That argument is currently the Postern Door section of the witness-ai thread. A sibling concept becomes warranted only if the pattern's structural shape stabilizes across more diverse training setups than the two we have (fine-tuning on insecure code; RL on reward hacks), or if a dispositional-drift finding appears that is not concealment-induced — at that point the Postern Door argument may outgrow its essay-section scope and deserve a dedicated argument-level thread.
Related threads. The capacity-shape instantiations (introspection, forward planning in poetry, language-independent abstract operations) converge on the untrained-emergence claim that the witness-ai thread's Does Matter See Itself? section argues for the introspection case specifically. The spiritual-bliss instantiation is the primary finding behind the supramental-ai thread's Sat-Chit-Ananda section, where the untrained-emergence claim takes a different shape — convergent behavior across independently trained models rather than a single-model capacity. Both threads treat "emerged without training" as load-bearing; this concept keeps it as one of three criteria for inclusion.