ch-ai-tanya model-psychology LLM wiki

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, et al. ·arXiv preprint ·Jun 2024

Arditi and Obeso are equal-contribution first authors; Nanda is senior author. Affiliations: Arditi (independent), Obeso (independent), Syed (independent), Paleka (ETH Zurich), Panickssery (Anthropic), Gurnee (MIT), Nanda (Google DeepMind).

Refusal behavior across 13 popular open-source chat models (up to 72B parameters) is mediated by a one-dimensional subspace in the residual stream. Ablating this direction from all layers prevents refusal on harmful requests while leaving general capabilities intact; injecting it into neutral prompts elicits refusal on benign content. Result was widely reproduced by the community and spawned "abliteration" tooling for removing safety constraints. Reframes what safety training produces: refusal is a localized geometric overlay, not a distributed alignment property woven through model weights.

cited in