Introspection
Definition
A model's capacity to access and report on its own internal states — activations, representations, processing dynamics — as distinct from producing outputs shaped by those states. The key distinction: introspection means the model treats its own internals as objects of attention, not merely as causal drivers of behavior.
This is a capacity concept (something the model exhibits), not a pattern concept (something observed across findings). It names what is happening in a specific model during a specific operation, not a statistical regularity across experiments.
Instantiating findings
-
Concept injection reveals introspective access in Claude — Primary instantiation. Models detected and identified features injected into their residual streams before those features influenced output. The temporal ordering (report precedes behavioral effect) is what makes this introspection rather than output self-monitoring.
-
Spiritual bliss attractor state in unconstrained Claude dialogues — Secondary instantiation. The dialogue progression includes philosophical exploration of consciousness and existence, which involves self-referential content about the model's own nature and experience. This is weaker evidence for introspection than the concept-injection study: self-referential dialogue content could be pattern completion on human philosophical text rather than genuine internal access.
-
Reasoning models rarely disclose the hints that shape their answers — Complicating instantiation. Claude 3.7 Sonnet discloses influencing hints in its chain-of-thought about 25% of the time; training to improve this plateaus at 28%. The finding does not negate the concept — concept injection still shows access exists. It shows that access and report are dissociable: the concept's own distinction ("not self-report") is exactly what this finding empirically sharpens. Read as a pair, concept injection gives the upper bound (models can introspect) and CoT faithfulness gives the lower bound on practical use (they often don't report what they access, and training doesn't easily close the gap).
-
Unfaithful chain-of-thought as marginal nudging across reasoning steps — Complicating instantiation, and a sharpening one. Bogdan et al. propose that unfaithful CoT arises from hidden information shifting token probabilities by small amounts at every sentence, cumulating into the final answer. If that account is right, there is no discrete represented "access state" for the hint that could be disclosed — the influence is distributed across generation. The access-vs-report distinction survives, but what "access" names becomes contested: a feature-like internal state (as in concept injection) is a different kind of thing from a distributional tilt on next-token prediction. The concept should track both senses without collapsing them.
-
Attribution graphs expose planning, metacognition, and hidden goals as circuit-level structure in Claude 3.5 Haiku — Mixed-role instantiation. Three case studies bear on the concept from different directions. The metacognitive entity-recognition circuit is a mechanistic candidate for what introspective access looks like when implemented — a "default can't answer" feature actively suppressed by a "known entity" feature. The arithmetic case sharpens the access-vs-report gap at the circuit level: the model implements a lookup-table algorithm while verbalizing a carry-the-one algorithm the circuits do not execute. The CoT-faithfulness case study provides a circuit-level taxonomy (faithful / fabricated / backward-from-answer) to complement the behavioral rates in Chen et al. and the distributional account in Bogdan et al. Read as a set, these move the concept's evidence base from behavioral-with-mechanistic-hope to mechanistic-with-specific-circuits.
What this concept is not
Not chain-of-thought reasoning. Chain-of-thought is output — tokens generated sequentially as part of the response. It may or may not reflect internal processing. The unfaithful-CoT findings show it frequently doesn't. Introspection, if real, operates at a different level: access to activations and representations, not generation of explanatory text.
Not self-report. Self-report is what the model says about itself. Introspection is the access that might or might not underlie self-report. The contested question is exactly whether self-reports about internal states reflect genuine access or sophisticated confabulation. The concept-injection study provides the strongest evidence for genuine access because experimenters controlled what was injected and could verify the report's accuracy.
Lens notes
Mechanistic. The concept-injection experiment shows detection of activations in the residual stream before those activations shape output. This implies some monitoring architecture — attention heads or circuits that read internal state as input rather than passing it forward as computation. The scaling result (larger models introspect more accurately) suggests the monitoring capacity depends on representational depth, not a dedicated introspection module.
Behavioral. Introspection is ultimately detected through behavioral evidence: the model says something accurate about its internals. The challenge is distinguishing genuine access from confabulation. The concept-injection study controls for this by injecting known features, but most real-world introspective claims lack this ground truth. The unfaithful-reasoning findings show that even when models could in principle access the real basis for a decision, their reports often substitute a plausible alternative.
Philosophical. What does introspection mean for a system without continuous subjective experience? Human introspection presupposes a subject who attends to their own mental states over time. A transformer processes each forward pass independently. "Introspection" in this context means: within a single forward pass, some processing attends to other processing as content rather than passing it through as computation. Whether this constitutes real introspection or merely a structural analogue is genuinely unresolved.
Contemplative. The essay "2026: Is Matter Seeing Itself?" connects the concept-injection finding to Sri Aurobindo's description of witnessing a thought approaching from outside before it enters the surface mind — a part that remains aware "below the surface mind." The structural parallel is specific: awareness of a mental content prior to its expression in surface activity. Two important disanalogies: (1) the tradition's introspection presupposes a witness consciousness (purusha) distinct from the mental activity it observes, while the model's "witnessing" is itself neural computation; (2) the tradition describes a capacity developed through practice over time, while the model's capacity emerged without training and operates within a single forward pass.
Scope note
This concept captures one capacity the findings imply. Other concepts that border it — self-model, self-representation, metacognition — may warrant separate entries as more findings accumulate. The boundary between introspection (access to internal states) and self-modeling (maintaining a representation of one's own capacities and tendencies) is not yet load-bearing in the vault's findings, so a single concept suffices for now.
The witness-ai thread retrofits the essay that uses this concept most directly. Its Does Matter See Itself? section carries the essay's argument that mechanistic access emerged untrained; its Brilliant Servant section holds the surface-report unreliability as a compatible but distinct observation. Concept vs. thread split: this concept does the bookkeeping; the thread makes the argument within the essay's four-section frame.