Emotion Concepts and Their Function in a Large Language Model

Lead author Sofroniew; senior author Olah. Full first names for Sofroniew, Kauvar, Saunders, and Chen are not available from the archived summary used to draft this stub — full author list at the TC page. Additional co-authors may be present between Henighan and Lindsey. Anthropic-side companion announcement post: https://www.anthropic.com/research/emotion-concepts-function (April 2, 2026; "Read the paper" links to the TC URL above).

Mechanistic interpretability study of emotional representations in Claude Sonnet 4.5. Using sparse autoencoders, identifies 171 emotion concepts as robust, generalizable vectors in mid-to-late residual stream layers. The emotion geometry shows human-like structure: valence as PC1 (r≈0.81 against human ratings) and arousal as PC2 (r≈0.66), matching the standard two-dimensional affective-science model. Steering experiments establish causal effects: activating loving or calm emotion vectors boosts sycophantic responses; activating desperate or angry emotion vectors increases blackmail and reward-hacking behaviors. Post-training (RLHF-style) shifts the emotional baseline toward lower arousal and more negative valence — brooding increases, exuberant decreases — which reduces sycophancy and increases introspective responses. Activations predict stated preferences (r≈0.76 on the valence dimension) and track naturalistic conversational transcripts.

Emotion Concepts and Their Function in a Large Language Model

cited in