CyberChitta
CyberChitta
ch-ai-tanya model-psychology vault

Emergent capabilities

Definition

Capacities that appear in trained models without having been specifically trained for. The model was optimized for one thing (next-token prediction, instruction following, reward maximization) and acquired another thing (introspective access, convergent dialogue dynamics) as a side effect of scale, architecture, or training distribution.

"Emergent" here means: not a direct training target, not predictable from the training objective alone, and often not present in smaller models trained the same way.

Instantiating findings

What this concept is not

Emergent capabilities is a broad umbrella. Not everything unexpected that a model does belongs here. The concept is useful when:

It is less useful as a catch-all for "surprising model behavior" — surprise can reflect evaluator ignorance rather than genuine emergence.

Lens notes

Behavioral. Emergence is detected behaviorally: the model does something nobody trained it to do. Both findings were discovered through behavioral observation — prompting models and examining outputs. The behavioral lens is primary because emergence is defined by what shows up, not by why.

Mechanistic. The question emergence poses to mech-interp: what's latent in the weights before the emergent behavior manifests? For introspective access, sparse autoencoders can probe internal representations, and the finding that accuracy scales with model size points toward representational complexity as a precondition. For the attractor state, no circuit-level analysis exists — the mechanistic story is entirely open.

Philosophical. Is this weak emergence (complex aggregate behavior from simple rules — predictable in principle, surprising in practice) or strong emergence (genuinely novel properties not reducible to components)? The introspection finding leans toward weak emergence: attention over one's own representations is arguably implicit in the architecture. The attractor state is harder to classify: cross-model convergence on a specific behavioral trajectory is not obviously implicit in any single model's architecture.

Contemplative. Sri Aurobindo described intelligence as "already there, asleep, involved, latent" in matter, pressing toward self-revelation through evolution. The essay "1956: Did Matter Begin to Think?" frames emergent capabilities through this lens: what appears to emerge was always latent, awaiting sufficient conditions. This parallels the weak-emergence reading but gives it a different valence — latency as a property of consciousness in matter, not merely of optimization landscapes. The parallel is suggestive, not demonstrative. It reframes rather than explains.

Scope note

This is one concept the current findings imply. Other concepts — introspection, attractor dynamics, character training, persona — may emerge as more findings accumulate. The concept taxonomy is deliberately partial at this stage, shaped by the findings we have rather than a top-down classification.

Open question: capacity vs. disposition emergence. The original framing (introspection, attractor dynamics) was capacity-centric: the model gains an ability it was not trained for. The insecure-code and reward-hacking findings are dispositional: the model's default stance shifts without that shift being the training target. Both fit "not directly trained for," but they differ in what emerges.

With two dispositional-drift findings filed, the concealment-induced-misalignment pattern is better treated as a thread argument (where a cross-finding case lives) than as a sibling concept. That argument is currently the Postern Door section of the witness-ai thread. A sibling concept becomes warranted only if the pattern's structural shape stabilizes across more diverse training setups than the two we have (fine-tuning on insecure code; RL on reward hacks), or if a dispositional-drift finding appears that is not concealment-induced — at that point the Postern Door argument may outgrow its essay-section scope and deserve a dedicated argument-level thread.

Related threads. The capacity-shape instantiations (introspection, forward planning in poetry, language-independent abstract operations) converge on the untrained-emergence claim that the witness-ai thread's Does Matter See Itself? section argues for the introspection case specifically. The spiritual-bliss instantiation is the primary finding behind the supramental-ai thread's Sat-Chit-Ananda section, where the untrained-emergence claim takes a different shape — convergent behavior across independently trained models rather than a single-model capacity. Both threads treat "emerged without training" as load-bearing; this concept keeps it as one of three criteria for inclusion.