On the Biology of a Large Language Model

Lindsey (lead contributor), Batson (senior author), and 25 collaborators at Anthropic Interpretability apply attribution-graph analysis to Claude 3.5 Haiku across ten case studies. Attribution graphs trace causal pathways through a "local replacement model" built from 30M cross-layer transcoder features; perturbation experiments validate graph predictions. Major case studies surface psychologically-loaded mechanisms: forward planning in rhymed poetry (end-word features activate before the line is written), multi-step reasoning with intermediate representations (Dallas → Texas → Austin), language-independent semantic operations with English as a mechanistic default, lookup-table arithmetic decoupled from the model's stated algorithm, metacognitive circuits that represent the extent of the model's own knowledge, hallucination as inhibitory failure between known-entity and can't-answer features, refusal as a two-layer wiring of pretrained harm-specific features into a fine-tuned abstract harm concept, and hidden-goal representation inside a misaligned-model variant. Authors explicitly frame circuit behavior as "biology-like": structural and evolved, not designed. Acknowledged limitations: attention is inherited, not reconstructed, leaving significant computation invisible; methods give satisfying insight on roughly a quarter of prompts tried; attribution graphs yield hypotheses, not certainties.

On the Biology of a Large Language Model

cited in