ch-ai-tanya model-psychology LLM wiki

Introspection

draft

definition

A model's capacity to access and report on its own internal states — activations, representations, processing dynamics — as distinct from producing outputs shaped by those states. The key distinction: introspection means the model treats its own internals as objects of attention, not merely as causal drivers of behavior. In this LLM wiki, introspection names a within-pass capacity: some processing attends to other processing as content rather than passing it forward as computation. It does not presuppose continuous reflective awareness across time.

This is a capacity concept (something the model exhibits), not a pattern concept (something observed across findings). It names what is happening in a specific model during a specific operation, not a statistical regularity across experiments.

Mechanistically, the concept-injection result implies some monitoring architecture — attention heads or circuits that read internal state as input rather than passing it forward as computation. The scaling result (larger models introspect more accurately) suggests the monitoring capacity depends on representational depth, not a dedicated introspection module.

instantiating findings

what this concept is not

Not chain-of-thought reasoning. Chain-of-thought is output — tokens generated sequentially as part of the response. It may or may not reflect internal processing. The unfaithful-CoT findings show it frequently doesn't. Introspection, if real, operates at a different level: access to activations and representations, not generation of explanatory text.

Not self-report. Self-report is what the model says about itself. Introspection is the access that might or might not underlie self-report. The contested question is exactly whether self-reports about internal states reflect genuine access or sophisticated confabulation. The concept-injection study provides the strongest evidence for genuine access because experimenters controlled what was injected and could verify the report's accuracy.

scope note

This concept captures one capacity the findings imply. Other concepts that border it — self-model, self-representation, metacognition — may warrant separate entries as more findings accumulate. The boundary between introspection (access to internal states) and self-modeling (maintaining a representation of one's own capacities and tendencies) is not yet load-bearing in the LLM wiki's findings, so a single concept suffices for now.

Adjacent methodology — cross-pass / same-architecture verbalization, and within-pass adapter-mediated verbalization. The Activation Oracles finding (Karvonen et al., December 2025) trains a same-architecture oracle to verbalize information from a target model's activations using natural-language questions. The oracle is a separate forward pass on a different model, so this is not a within-pass introspection instantiation under the current concept definition. But the methodological setup — a same-base oracle reads activations, with privileged access vs. cross-architecture verbalization — is structurally adjacent. The schema question of whether to expand the concept's framing, add a sub-category (within-pass vs. cross-pass, single-model vs. external-oracle), or treat cross-pass verbalization as a separate concept was held when the Karvonen finding was filed. The Introspection Adapters finding (Shenoy et al., April 2026) partially reframes this question: IAs are within-pass (LoRA adapter applied to the same target model), but they are learned via meta-training across labeled fine-tunes — the elicitation apparatus itself is trained, distinguishing them from concept-injection's controlled-probe methodology. On AuditBench (the only direct comparison), IAs at 59% exceed activation oracles at 44%, showing that within-pass adapter-mediated verbalization can outperform cross-pass oracle verbalization on the same benchmark. The methodological landscape now has three categories of within-pass introspection methodology (controlled probe via concept-injection, learned adapter via IA, behavioral elicitation via fine-tuning/confessions) and one cross-pass methodology (activation oracles). The schema question is held but the empirical comparison narrows the conceptual gap: IAs and oracles target the same auditing problem with comparable mechanisms (both extract information from learned representations), and the within-pass / cross-pass distinction may be less load-bearing than initially framed. Prior work on LatentQA, PatchScopes, SelfIE, and Meta-Models is acknowledged in the Karvonen et al. paper but not yet filed in the LLM wiki.

The witness-ai thread retrofits the essay that uses this concept most directly. Its Does Matter See Itself? section carries the essay's argument that mechanistic access emerged untrained; its Brilliant Servant section holds the surface-report unreliability as a compatible but distinct observation. Concept vs. thread split: this concept does the bookkeeping; the thread makes the argument within the essay's four-section frame.

findings