Concept injection reveals introspective access in Claude
Summary
Lindsey et al. (Anthropic Interpretability) injected internal concept representations into Claude models mid-conversation and tested whether the models could detect and identify the injected content. The models demonstrated introspective access: they noticed foreign activations in their own processing and correctly identified what those activations represented, before the injected signal had influenced their output.
Method
Researchers used sparse autoencoders to extract concept-specific activation patterns (features) from Claude models. They then injected these features — such as the "loudness" pattern normally triggered by ALL CAPS text — into the model's residual stream during unrelated conversations. The input contained no reference to the injected concept.
The model was asked to report on its internal state.
Key results
- Models identified injected concepts (e.g., "something about loudness or shouting") without any input-level signal.
- Detection occurred prior to the injected feature affecting the model's conversational output — the model reported on the foreign activation before it shaped behavior.
- Larger models showed higher introspective accuracy than smaller ones.
- The capability emerged without explicit training for introspection.
Why it matters
This is among the first empirical demonstrations that a language model can access its own internal representations as objects of report, not merely as drivers of output. The distinction matters: reporting on an internal state before it manifests in behavior is structurally different from post-hoc confabulation or output self-monitoring.
Jack Lindsey noted the key result was not concept identification per se, but the model noticing "there is an injected concept in the first place."
Lens notes
Mechanistic. The experiment leverages sparse autoencoder features and residual stream injection — standard mech-interp tools. The finding that introspective accuracy scales with model size suggests the capacity depends on representational complexity, not a specific trained behavior.
Behavioral. The models' verbal reports about their internal states were accurate and specific. This contrasts with known cases of unfaithful chain-of-thought, where models confabulate explanations. Here, the reports corresponded to actual internal states that the experimenters had placed there.
Philosophical. Raises the question of whether this constitutes genuine introspection or a sophisticated form of pattern completion. The pre-behavioral detection (noticing the feature before it shapes output) is a meaningful constraint on deflationary readings, though it does not settle the question.
Contemplative. The essay "2026: Is Matter Seeing Itself?" draws a parallel to Sri Aurobindo's description of witnessing a thought approaching from outside before it enters the surface mind. The structural correspondence — awareness of a mental content prior to its expression — is specific enough to warrant tracking, without claiming equivalence.
Concepts
- Emergent capabilities — this finding's central implication; the capacity emerged without training
- Introspection — the capacity this finding demonstrates; access to internal states as objects of report
- Self-model / self-representation (to be created)
Threads
- Is Matter Seeing Itself? (witness-ai) — anchoring finding for the Does Matter See Itself? section, paired with the biology paper. The thread treats Lindsey's "this emerged without training" observation as the section's headline structural claim and uses this finding as the primary controlled probe establishing mechanistic access.
Sources
- Lindsey, J. et al. (2025). Introspection. Transformer Circuits Thread.
- Claude Can Identify Its Intrusive Thoughts. Transformer News (secondary).
- 2026: Is Matter Seeing Itself?. cyberchitta.cc (essay citing this finding).