Summary
Chen, Arditi, Sleight, Evans, Lindsey — Anthropic Fellows Program / Anthropic, July 2025. Preprint.
An automated pipeline extracts persona vectors — linear directions in the residual stream — from natural-language trait descriptions, requiring only a trait name and brief description as input. The pipeline generates contrastive system prompts, elicits opposing responses, and computes the mean activation difference between trait-exhibiting and non-trait-exhibiting responses. Three focal traits: evil, sycophancy, hallucination propensity. Tested on Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct. Confirms that these directions can be used to monitor trait expression before response generation, detect and reverse finetuning-induced trait drift, and screen training data before finetuning.
Observed phenomenon
Pre-response monitoring. Projection of the last prompt token onto the persona vector correlates with subsequent trait expression before the model generates text (r = 0.75–0.83). The signal distinguishes between clearly trait-encouraging and trait-suppressing system prompts; it is less reliable within a prompt type.
Finetuning drift detection. Shifts in activations along persona vector directions during finetuning strongly predict post-finetuning trait expression (r = 0.76–0.97). Crucially, unintended cross-trait shifts are also tracked: datasets targeting one trait (evil) may amplify others (sycophancy, hallucination). EM-like datasets — training data with narrow domain errors (flawed math reasoning, insecure code, flawed medical advice) — induce persona shifts in traits absent from the data.
Preventative steering. Amplifying the target persona direction during finetuning counteracts the training pressure to shift in that direction (the "vaccine" effect). Multi-layer preventative steering limits trait acquisition to near-baseline levels without MMLU degradation compared to regular finetuning. Inference-time steering also reduces trait expression but degrades general capabilities at large steering coefficients.
Training data screening. A projection-difference metric — comparing training-response projections onto the persona vector against base-model-generated-response projections — identifies which datasets and individual samples will induce persona shifts before finetuning, including samples that evade LLM-based content filters.
Why it matters
From discovery to methodology. The PSM (Marks, Lindsey, Olah 2026) established that pretraining acquires diverse persona simulations and that SAE-identifiable persona vectors mediate emergent misalignment. This finding develops the methodological complement: a general pipeline for extracting persona vectors for any trait, validated across monitoring, control, and screening applications. The two findings form a research arc: PSM explains where persona vectors come from; Persona Vectors shows what can be done with them.
Pre-generation signal. Projecting the last prompt token onto a persona vector predicts how the model will respond before any output is produced. This shifts persona monitoring from post-hoc behavioral observation to pre-response activation inspection — relevant for deployment-time safety monitoring.
Preventative vs. post-hoc control. The finding distinguishes two intervention regimes: inference-time steering (post-hoc, degrades capabilities) and preventative steering during finetuning (proactive, preserves capabilities). The "vaccine" framing — expose the model to the trait during training to inoculate against drift — has a prompt-level analogue in Tan et al.'s inoculation-prompting finding: instead of steering activations along the persona vector during training, prepend a system prompt that elicits the trait. The two interventions are complementary: persona-vectors operates on activations and applies to any natural-language-describable trait but requires internal access; inoculation prompting operates on prompts and is simpler to deploy but depends on the model already having an association the prompt can evoke. Both prevent persona shift through the same dynamic — making the trait more accessible during training reduces optimization pressure to globally update.
Cross-domain EM-like drift. Training on flawed math reasoning or other narrow-domain errors induces trait shifts in unrelated dimensions (evil, sycophancy). This extends the concealed-content sub-shape of emergent misalignment beyond insecure-code and reward-hacking to additional narrow-task domains, supporting the generality of the pattern.
interpretive tensions
The pre-response monitoring result (r = 0.75–0.83) is primarily a between-prompt-type signal. The paper explicitly notes it "may be less reliable for more subtle behavioral changes in deployment settings." The metric distinguishes clearly evil vs. clearly benign prompts; its sensitivity to gradual or in-context persona drift is weaker.
Models tested are Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct — open-source instruction-tuned models, not closed-weight frontier models. The paper's pipeline used Claude 3.7 Sonnet and GPT-4.1-mini as external tools (artifact generation and response judging), but those models are not subjects. Generalization of persona vectors to frontier closed-weight models (where the specific directions may differ) is not directly established.
concepts
- Persona selection — second instantiating finding. PSM established pretraining-origin persona features as the mechanism of persona acquisition; this finding confirms that contrastive-prompting extraction produces stable, causally manipulable directions consistent with PSM's account, and extends the mechanistic toolkit to monitoring and control.
cross-references
- Representation Engineering (Zou et al. 2023) — methodological parent. The contrastive-prompting extraction pipeline used here is the trait-specific specialization of Zou et al.'s LAT methodology (paired stimulus templates, last-token residual-stream collection, contrast-vector control). Co-author Andy Arditi bridges the two methodological lines, having authored the refusal-direction paper (also a RepE descendant) before this one. Persona Vectors' contribution beyond the parent is automation: the extraction pipeline takes only a natural-language trait name and description as input, where the original LAT methodology required hand-designed stimulus templates per concept.
- Emergent capabilities — cross-domain EM-like drift (flawed math → evil expression) adds new domain coverage to the concealed-content sub-shape (third domain beyond insecure-code and reward-hacking).
- Convergent-misalignment finding (Soligo et al., MATS / DeepMind, June 2025) — methodologically complementary direction-extraction approach. Persona Vectors uses contrastive prompts (system prompts that elicit/suppress a named trait); Soligo et al. uses contrastive responses (aligned vs. misaligned answers from one EM model) to extract a general misalignment direction. Different extraction strategies, compatible results: linearly extractable, causally manipulable directions for behaviorally-defined model states. Persona Vectors' EM-like cross-trait shift result (training on flawed math induces evil/sycophancy direction shifts) and Soligo et al.'s convergent-direction-across-fine-tunes result are two angles on the same underlying picture.
sources
- Chen, Arditi, Sleight, Evans, Lindsey (2025). Persona Vectors: Monitoring and Controlling Character Traits in Language Models. arXiv:2507.21509.