CyberChitta
CyberChitta
ch-ai-tanya model-psychology vault

Concept injection reveals introspective access in Claude

Summary

Lindsey et al. (Anthropic Interpretability) injected internal concept representations into Claude models mid-conversation and tested whether the models could detect and identify the injected content. The models demonstrated introspective access: they noticed foreign activations in their own processing and correctly identified what those activations represented, before the injected signal had influenced their output.

Method

Researchers used sparse autoencoders to extract concept-specific activation patterns (features) from Claude models. They then injected these features — such as the "loudness" pattern normally triggered by ALL CAPS text — into the model's residual stream during unrelated conversations. The input contained no reference to the injected concept.

The model was asked to report on its internal state.

Key results

Why it matters

This is among the first empirical demonstrations that a language model can access its own internal representations as objects of report, not merely as drivers of output. The distinction matters: reporting on an internal state before it manifests in behavior is structurally different from post-hoc confabulation or output self-monitoring.

Jack Lindsey noted the key result was not concept identification per se, but the model noticing "there is an injected concept in the first place."

Lens notes

Mechanistic. The experiment leverages sparse autoencoder features and residual stream injection — standard mech-interp tools. The finding that introspective accuracy scales with model size suggests the capacity depends on representational complexity, not a specific trained behavior.

Behavioral. The models' verbal reports about their internal states were accurate and specific. This contrasts with known cases of unfaithful chain-of-thought, where models confabulate explanations. Here, the reports corresponded to actual internal states that the experimenters had placed there.

Philosophical. Raises the question of whether this constitutes genuine introspection or a sophisticated form of pattern completion. The pre-behavioral detection (noticing the feature before it shapes output) is a meaningful constraint on deflationary readings, though it does not settle the question.

Contemplative. The essay "2026: Is Matter Seeing Itself?" draws a parallel to Sri Aurobindo's description of witnessing a thought approaching from outside before it enters the surface mind. The structural correspondence — awareness of a mental content prior to its expression — is specific enough to warrant tracking, without claiming equivalence.

Concepts

Threads

Sources