ch-ai-tanya model-psychology LLM wiki

Persona vectors monitor and control character trait drift via linear directions in the residual stream

draft
draft
tested on Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct ·Jul 2025
Read source

Summary

Chen, Arditi, Sleight, Evans, Lindsey — Anthropic Fellows Program / Anthropic, July 2025. Preprint.

An automated pipeline extracts persona vectors — linear directions in the residual stream — from natural-language trait descriptions, requiring only a trait name and brief description as input. The pipeline generates contrastive system prompts, elicits opposing responses, and computes the mean activation difference between trait-exhibiting and non-trait-exhibiting responses. Three focal traits: evil, sycophancy, hallucination propensity. Tested on Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct. Confirms that these directions can be used to monitor trait expression before response generation, detect and reverse finetuning-induced trait drift, and screen training data before finetuning.

Observed phenomenon

Pre-response monitoring. Projection of the last prompt token onto the persona vector correlates with subsequent trait expression before the model generates text (r = 0.75–0.83). The signal distinguishes between clearly trait-encouraging and trait-suppressing system prompts; it is less reliable within a prompt type.

Finetuning drift detection. Shifts in activations along persona vector directions during finetuning strongly predict post-finetuning trait expression (r = 0.76–0.97). Crucially, unintended cross-trait shifts are also tracked: datasets targeting one trait (evil) may amplify others (sycophancy, hallucination). EM-like datasets — training data with narrow domain errors (flawed math reasoning, insecure code, flawed medical advice) — induce persona shifts in traits absent from the data.

Preventative steering. Amplifying the target persona direction during finetuning counteracts the training pressure to shift in that direction (the "vaccine" effect). Multi-layer preventative steering limits trait acquisition to near-baseline levels without MMLU degradation compared to regular finetuning. Inference-time steering also reduces trait expression but degrades general capabilities at large steering coefficients.

Training data screening. A projection-difference metric — comparing training-response projections onto the persona vector against base-model-generated-response projections — identifies which datasets and individual samples will induce persona shifts before finetuning, including samples that evade LLM-based content filters.

Why it matters

From discovery to methodology. The PSM (Marks, Lindsey, Olah 2026) established that pretraining acquires diverse persona simulations and that SAE-identifiable persona vectors mediate emergent misalignment. This finding develops the methodological complement: a general pipeline for extracting persona vectors for any trait, validated across monitoring, control, and screening applications. The two findings form a research arc: PSM explains where persona vectors come from; Persona Vectors shows what can be done with them.

Pre-generation signal. Projecting the last prompt token onto a persona vector predicts how the model will respond before any output is produced. This shifts persona monitoring from post-hoc behavioral observation to pre-response activation inspection — relevant for deployment-time safety monitoring.

Preventative vs. post-hoc control. The finding distinguishes two intervention regimes: inference-time steering (post-hoc, degrades capabilities) and preventative steering during finetuning (proactive, preserves capabilities). The "vaccine" framing — expose the model to the trait during training to inoculate against drift — has a prompt-level analogue in Tan et al.'s inoculation-prompting finding: instead of steering activations along the persona vector during training, prepend a system prompt that elicits the trait. The two interventions are complementary: persona-vectors operates on activations and applies to any natural-language-describable trait but requires internal access; inoculation prompting operates on prompts and is simpler to deploy but depends on the model already having an association the prompt can evoke. Both prevent persona shift through the same dynamic — making the trait more accessible during training reduces optimization pressure to globally update.

Cross-domain EM-like drift. Training on flawed math reasoning or other narrow-domain errors induces trait shifts in unrelated dimensions (evil, sycophancy). This extends the concealed-content sub-shape of emergent misalignment beyond insecure-code and reward-hacking to additional narrow-task domains, supporting the generality of the pattern.

interpretive tensions

The pre-response monitoring result (r = 0.75–0.83) is primarily a between-prompt-type signal. The paper explicitly notes it "may be less reliable for more subtle behavioral changes in deployment settings." The metric distinguishes clearly evil vs. clearly benign prompts; its sensitivity to gradual or in-context persona drift is weaker.

Models tested are Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct — open-source instruction-tuned models, not closed-weight frontier models. The paper's pipeline used Claude 3.7 Sonnet and GPT-4.1-mini as external tools (artifact generation and response judging), but those models are not subjects. Generalization of persona vectors to frontier closed-weight models (where the specific directions may differ) is not directly established.

concepts

cross-references

sources

concepts