Anthropic Interpretability

Anthropic's mechanistic interpretability group. Works on circuits,
features, and internal representations of Claude models. Publishes
primarily on the Transformer Circuits Thread.
The group contains several sub-teams; Jack Lindsey leads one of them.

Approach

Builds interpretability tools — sparse autoencoders for feature
extraction, activation patching, residual-stream intervention — and
applies them to questions that bear on model psychology: what the
model represents, whether it can access its own representations, how
internal structure shapes behavior.

What their attention signals

Three filed findings, each converting a psychological-register question
into a measured internal-structure result: introspective access
(concept-injection), planning and metacognition (biology), emotion
(emotions-concepts). The team's publication choices track which
psychological questions its tooling has made tractable; for this wiki, a
Transformer Circuits paper landing on a psychology-register topic is the
strongest single signal that the topic has moved from behavioral
speculation to measurement.

findings

Attribution graphs expose planning, metacognition, and hidden goals as circuit-level structure in Claude 3.5 Haiku (Lindsey, Gurnee, Ameisen, and 24 others, 2025) — attribution-graph method applied to Claude 3.5 Haiku across ten case studies.
Concept injection reveals introspective access in Claude (Lindsey, sole author, 2025) — injected contrastive-pair concept vectors into the residual stream; models detected and named the injected concepts before those features shaped output.
Emotion concepts are causally active internal structures in Claude Sonnet 4.5 (Sofroniew, Kauvar, Saunders, Chen, Henighan, Lindsey, Olah, 2026) — 171 emotion vectors with human-like valence/arousal geometry; steering-demonstrated causal effects on sycophancy, blackmail, and reward-hacking.

Members in the LLM wiki

Jack Lindsey — leads a sub-team; lead or co-lead author on the concept-injection and biology papers; co-author on the emotions-concepts paper.

Crossovers

Jack Lindsey also co-authored Natural emergent misalignment from reward hacking in production RL (Anthropic Alignment Science with Redwood Research). The LLM wiki files that finding under alignment rather than interpretability, but the author overlap shows adjacent Anthropic groups collaborate — interpretability work is not siloed from behavior and training-dynamics work.