CyberChitta
CyberChitta
ch-ai-tanya model-psychology vault

Anthropic Interpretability

Anthropic's mechanistic interpretability group. Works on circuits, features, and internal representations of Claude models. Publishes primarily on the Transformer Circuits Thread. The group contains several sub-teams; Jack Lindsey leads one of them.

Approach

Builds interpretability tools — sparse autoencoders for feature extraction, activation patching, residual-stream intervention — and applies them to questions that bear on model psychology: what the model represents, whether it can access its own representations, how internal structure shapes behavior.

In-vault findings

Members in the vault

Crossovers

Jack Lindsey also co-authored Natural emergent misalignment from reward hacking in production RL (Anthropic Alignment Science with Redwood Research). The vault files that finding under alignment rather than interpretability, but the author overlap shows adjacent Anthropic groups collaborate — interpretability work is not siloed from behavior and training-dynamics work.