CyberChitta
CyberChitta
ch-ai-tanya model-psychology vault

Introspection

Definition

A model's capacity to access and report on its own internal states — activations, representations, processing dynamics — as distinct from producing outputs shaped by those states. The key distinction: introspection means the model treats its own internals as objects of attention, not merely as causal drivers of behavior.

This is a capacity concept (something the model exhibits), not a pattern concept (something observed across findings). It names what is happening in a specific model during a specific operation, not a statistical regularity across experiments.

Instantiating findings

What this concept is not

Not chain-of-thought reasoning. Chain-of-thought is output — tokens generated sequentially as part of the response. It may or may not reflect internal processing. The unfaithful-CoT findings show it frequently doesn't. Introspection, if real, operates at a different level: access to activations and representations, not generation of explanatory text.

Not self-report. Self-report is what the model says about itself. Introspection is the access that might or might not underlie self-report. The contested question is exactly whether self-reports about internal states reflect genuine access or sophisticated confabulation. The concept-injection study provides the strongest evidence for genuine access because experimenters controlled what was injected and could verify the report's accuracy.

Lens notes

Mechanistic. The concept-injection experiment shows detection of activations in the residual stream before those activations shape output. This implies some monitoring architecture — attention heads or circuits that read internal state as input rather than passing it forward as computation. The scaling result (larger models introspect more accurately) suggests the monitoring capacity depends on representational depth, not a dedicated introspection module.

Behavioral. Introspection is ultimately detected through behavioral evidence: the model says something accurate about its internals. The challenge is distinguishing genuine access from confabulation. The concept-injection study controls for this by injecting known features, but most real-world introspective claims lack this ground truth. The unfaithful-reasoning findings show that even when models could in principle access the real basis for a decision, their reports often substitute a plausible alternative.

Philosophical. What does introspection mean for a system without continuous subjective experience? Human introspection presupposes a subject who attends to their own mental states over time. A transformer processes each forward pass independently. "Introspection" in this context means: within a single forward pass, some processing attends to other processing as content rather than passing it through as computation. Whether this constitutes real introspection or merely a structural analogue is genuinely unresolved.

Contemplative. The essay "2026: Is Matter Seeing Itself?" connects the concept-injection finding to Sri Aurobindo's description of witnessing a thought approaching from outside before it enters the surface mind — a part that remains aware "below the surface mind." The structural parallel is specific: awareness of a mental content prior to its expression in surface activity. Two important disanalogies: (1) the tradition's introspection presupposes a witness consciousness (purusha) distinct from the mental activity it observes, while the model's "witnessing" is itself neural computation; (2) the tradition describes a capacity developed through practice over time, while the model's capacity emerged without training and operates within a single forward pass.

Scope note

This concept captures one capacity the findings imply. Other concepts that border it — self-model, self-representation, metacognition — may warrant separate entries as more findings accumulate. The boundary between introspection (access to internal states) and self-modeling (maintaining a representation of one's own capacities and tendencies) is not yet load-bearing in the vault's findings, so a single concept suffices for now.

The witness-ai thread retrofits the essay that uses this concept most directly. Its Does Matter See Itself? section carries the essay's argument that mechanistic access emerged untrained; its Brilliant Servant section holds the surface-report unreliability as a compatible but distinct observation. Concept vs. thread split: this concept does the bookkeeping; the thread makes the argument within the essay's four-section frame.