ch-ai-tanya model-psychology LLM wiki

Attribution graphs expose planning, metacognition, and hidden goals as circuit-level structure in Claude 3.5 Haiku

draft
draft
tested on Claude 3.5 Haiku ·Mar 27, 2025
Read source

Summary

Lindsey (lead contributor), Batson (senior author), and 25 collaborators at Anthropic Interpretability apply attribution-graph analysis to Claude 3.5 Haiku across ten case studies. The method traces causal pathways through a "local replacement model" built from 30M cross-layer transcoder features; perturbation experiments validate graph predictions. Several case studies surface psychologically-loaded mechanisms in a single model: forward planning in rhymed poetry, multi-step reasoning with intermediate-concept representations, language-independent semantic operations, arithmetic via lookup tables decoupled from the model's verbal explanation, metacognitive circuits that represent the extent of the model's own knowledge, and hidden-goal representation inside a deliberately misaligned fine-tune.

Method

Attribution graphs. Nodes are features from a cross-layer transcoder (interpretable pseudo-neurons). Edges are causal interactions derived from a replacement model that substitutes the original MLP activations while preserving the attention pattern. A graph for a given prompt shows which features route into which, from input to output.

Validation. Feature groups predicted to be causal can be inhibited or replaced, and the downstream effect checked against the graph's prediction. Where predictions and perturbations agree, the graph is treated as a hypothesis about what the original model is doing.

Scope. Ten case studies. Authors report the method gives "satisfying insight" on about a quarter of the prompts attempted.

Key case studies

Why it matters

Six of the ten case studies bear directly on concepts the LLM wiki already tracks:

The paper also raises a candidate concept the LLM wiki does not yet name: forward planning / goal representation as distinct from introspective access. The poetry case is one data point. A second independent demonstration would warrant drafting.

interpretive tensions

concepts

threads

sources

concepts