Concept injection reveals introspective access in Claude

Summary

Lindsey (Anthropic Interpretability) injected internal concept representations into 9 Claude models mid-conversation and tested whether the models could detect and identify the injected content. Under this protocol the models demonstrated functional introspective access: they noticed foreign activations in their own processing and correctly identified what those activations represented, before the injected signal had influenced their output. The paper frames the result as an existence proof of the capability, not a claim about its generality — it is highly unreliable and context-dependent (removing an "answer immediately" instruction substantially reduces performance). Claude Opus 4.1 and 4 performed best (~20% true positive rate at the optimal layer, with zero false positives). Performance does not strongly correlate with model capability across the other models tested; base pretrained models achieve zero net performance, and the paper is explicit that post-training is key to eliciting the capability.

Method

Concept vectors were extracted via contrastive pairs — the activation on a prompt containing a concept minus the activation on a control prompt (for the 50-concept sweep, a single positive example against the mean of many prompts). These vectors — such as the "loudness" pattern associated with ALL CAPS text — were injected into the model's residual stream during unrelated conversations, after the protocol first explained the thought-injection setup to the model. The input contained no reference to the injected concept.

The model was then asked to report on its internal state. Detection was scored by an LLM judge across 100 control trials per condition; a separate experiment tested whether models could distinguish injected "thoughts" from text inputs.

Key results

Models identified injected concepts (e.g., "something about loudness or shouting") without any input-level signal.
Detection occurred prior to the injected feature affecting the model's conversational output — the model reported on the foreign activation before it shaped behavior.
Opus 4.1 and 4 achieve ~20% true positive rate at the optimal injection layer (approximately two-thirds through the model) with zero false positives. Performance does not strongly correlate with model capability among other models.
Post-training strategy significantly affects performance: helpful-only variants (trained to avoid refusals) sometimes achieve higher true positive rates but also show false positives; standard production models show zero false positives.
Models can also distinguish injected "thoughts" from simultaneous text inputs, and can use introspection to detect artificially prefilled outputs.
The capability was not trained for directly — the models never experience concept injection during training. But it is not present in base pretrained models (which show high false-positive rates and zero net performance); the paper states post-training is key to eliciting it, and post-training strategy strongly influences performance.

Why it matters

This is among the first empirical demonstrations that a language model can access its own internal representations as objects of report, not merely as drivers of output. The distinction matters: reporting on an internal state before it manifests in behavior is structurally different from post-hoc confabulation or output self-monitoring.

Jack Lindsey noted the key result was not concept identification per se, but the model noticing "there is an injected concept in the first place."

concepts

Emergent capabilities — this finding's central implication; the capacity emerged without training
Introspection — the capacity this finding demonstrates; access to internal states as objects of report

threads

Is Matter Seeing Itself? (witness-ai) — anchoring finding for the Does Matter See Itself? section, paired with the biology paper. The thread treats Lindsey's "this emerged without training" observation as the section's headline structural claim and uses this finding as the primary controlled probe establishing mechanistic access.

sources

Lindsey, J. (2025). Emergent Introspective Awareness in Large Language Models. Transformer Circuits Thread.
Claude Can Identify Its Intrusive Thoughts. Transformer News (secondary).
2026: Is Matter Seeing Itself?. cyberchitta.cc (essay citing this finding).