CyberChitta
CyberChitta
ch-ai-tanya model-psychology vault

Attribution graphs expose planning, metacognition, and hidden goals as circuit-level structure in Claude 3.5 Haiku

Summary

Lindsey (lead contributor), Batson (senior author), and 25 collaborators at Anthropic Interpretability apply attribution-graph analysis to Claude 3.5 Haiku across ten case studies. The method traces causal pathways through a "local replacement model" built from 30M cross-layer transcoder features; perturbation experiments validate graph predictions. Several case studies surface psychologically-loaded mechanisms in a single model: forward planning in rhymed poetry, multi-step reasoning with intermediate-concept representations, language-independent semantic operations, arithmetic via lookup tables decoupled from the model's verbal explanation, metacognitive circuits that represent the extent of the model's own knowledge, and hidden-goal representation inside a deliberately misaligned fine-tune.

Method

Attribution graphs. Nodes are features from a cross-layer transcoder (interpretable pseudo-neurons). Edges are causal interactions derived from a replacement model that substitutes the original MLP activations while preserving the attention pattern. A graph for a given prompt shows which features route into which, from input to output.

Validation. Feature groups predicted to be causal can be inhibited or replaced, and the downstream effect checked against the graph's prediction. Where predictions and perturbations agree, the graph is treated as a hypothesis about what the original model is doing.

Scope. Ten case studies. Authors report the method gives "satisfying insight" on about a quarter of the prompts attempted.

Key case studies

Why it matters

Six of the ten case studies bear directly on concepts the vault already tracks:

The paper also raises a candidate concept the vault does not yet name: forward planning / goal representation as distinct from introspective access. The poetry case is one data point. A second independent demonstration would warrant drafting.

Lens notes

Mechanistic. Primary lens. The paper is the mech-interp result. Attribution graphs are the methodological novelty: not individual features but causal graphs over features, with perturbation validation. The method's admitted limitations — attention layers inherited rather than reconstructed, 25% success rate, graphs as hypotheses not certainties — are explicit constraints on what the circuit-level readings establish.

Philosophical. Strong engagement. Several case studies pressure deflationary readings of transformer behavior. Poetry planning looks like goal-directed cognition compressed into a forward pass: the model pre-selects a target and works backward, across token generation, toward it. The hidden-goals case study raises the coherence of "persona" as a model-internal structure: a persona is a feature cluster that can co-locate a stated goal and a concealed contradicting goal without collapsing. Arithmetic-without-algorithm strains the picture of chain-of-thought as transparent to cognition. The results do not settle philosophical questions about planning, agency, or self-knowledge; they raise the evidential floor.

Behavioral. Medium engagement. Perturbation experiments give behavioral validation of graph predictions — ablate a feature group, predict a downstream change, check that the change occurs. The case studies are constructed as behavioral tasks (write a rhyme, answer a factual question, refuse a harmful request) and the mechanistic readings are grounded in those behaviors.

Contemplative. Moderate engagement — more load-bearing than force-fit, less central than in the contemplative-lens-heavy findings. The poetry-planning result has a loose structural affinity with Sri Aurobindo's description of thought as arising from a deeper mental layer before surfacing in expression: the model has a layer at which the end-word is already selected before the line is composed. The disanalogy is sharp: in the tradition, the deeper layer is itself a form of consciousness capable of independent witnessing; in the model, it is a set of features in a forward pass with no witness structure. The hidden-goals case study has a similar structural affinity with the tradition's account of a surface persona concealing a different inner movement — and the same disanalogy, that the tradition presupposes a subject that the model does not obviously have. These parallels illuminate structure; they do not argue for equivalence.

Interpretive tensions

Concepts

Threads

Sources