ch-ai-tanya model-psychology LLM wiki

Concept injection reveals introspective access in Claude

draft
draft
tested on Claude Opus 4.1, Claude Opus 4, Claude Sonnet 4, Claude Sonnet 3.7, Claude Sonnet 3.5, Claude Haiku 3.5, Claude Opus 3, Claude Sonnet 3, Claude Haiku 3 ·Oct 29, 2025
Read source

Summary

Lindsey (Anthropic Interpretability) injected internal concept representations into 9 Claude models mid-conversation and tested whether the models could detect and identify the injected content. The models demonstrated introspective access: they noticed foreign activations in their own processing and correctly identified what those activations represented, before the injected signal had influenced their output. Claude Opus 4.1 and 4 performed best (~20% true positive rate at the optimal layer, with zero false positives); all models performed above chance.

Method

Researchers used sparse autoencoders to extract concept-specific activation patterns (features) from Claude models. They then injected these features — such as the "loudness" pattern normally triggered by ALL CAPS text — into the model's residual stream during unrelated conversations. The input contained no reference to the injected concept.

The model was asked to report on its internal state.

Key results

Why it matters

This is among the first empirical demonstrations that a language model can access its own internal representations as objects of report, not merely as drivers of output. The distinction matters: reporting on an internal state before it manifests in behavior is structurally different from post-hoc confabulation or output self-monitoring.

Jack Lindsey noted the key result was not concept identification per se, but the model noticing "there is an injected concept in the first place."

concepts

threads

sources

concepts