ch-ai-tanya model-psychology LLM wiki

Anthropic's mechanistic interpretability group. Works on circuits, features, and internal representations of Claude models. Publishes primarily on the Transformer Circuits Thread. The group contains several sub-teams; Jack Lindsey leads one of them.

Approach

Builds interpretability tools — sparse autoencoders for feature extraction, activation patching, residual-stream intervention — and applies them to questions that bear on model psychology: what the model represents, whether it can access its own representations, how internal structure shapes behavior.

In-wiki findings

Members in the LLM wiki

Crossovers

Jack Lindsey also co-authored Natural emergent misalignment from reward hacking in production RL (Anthropic Alignment Science with Redwood Research). The LLM wiki files that finding under alignment rather than interpretability, but the author overlap shows adjacent Anthropic groups collaborate — interpretability work is not siloed from behavior and training-dynamics work.