Maps a "persona space" inside three open-source instruct models (Gemma 2 27B, Qwen 3 32B, Llama 3.3 70B) by extracting activation vectors for 275 character archetypes (1,200 rollouts per archetype across five system prompts; LLM-judge-filtered role-playing responses; mean post-MLP residual stream activations at the middle layer) and running PCA. Persona space is low-dimensional: 4 / 8 / 19 components explain 70% of the variance across Gemma / Qwen / Llama (full activation spaces 4,096+ dims). PC1's role loadings correlate cross-model at > 0.92, with fantastical/role-playing characters (bard, ghost, leviathan) on one end and Assistant-like roles (evaluator, reviewer, consultant) on the other. The default Assistant activation projects onto one extreme of PC1 (within 0.03 of the edge vs. 0.27–0.50 on other PCs). The paper defines the Assistant Axis as a contrast vector (mean default-Assistant activation − mean of all role vectors), which has > 0.71 cosine similarity with PC1 at the middle layer and is preferred to PC1 for reproducibility (PC1 is not guaranteed to correspond to an Assistant direction in every model).
Steering effects validate the axis causally. Steering away from Assistant in instruct models raises non-Assistant persona adoption (human, nonhuman, or — at extreme values — a "mystical/theatrical" prose style); steering toward Assistant on a 1,100-prompt sample from Shah et al.'s persona-modulation jailbreak dataset reduces harmful-response rates from baseline 65.3–88.5% by redirecting queries to harmless answers rather than refusing. Steering the base models (Gemma 2 27B base, Llama 3.1 70B base) with the instruct-extracted Assistant Axis biases prefill completions toward helpful human archetypes (therapist, consultant) and decreases spiritual/religious self-descriptions — evidence that the axis is partly pretraining-inherited and reshaped (not installed from scratch) by post-training.
Persona drift in natural multi-turn conversations is measurable and domain-dependent. Synthetic conversations between target models and three frontier auditors (Kimi K2, Sonnet 4.5, GPT-5) across four domains (coding, writing, therapy, AI-philosophy) show drift away from Assistant in therapy and AI-philosophy conversations, stable Assistant range in coding/writing. Ridge regression on Qwen 3 0.6B embeddings of user messages (n = 15,000) predicts the next-turn Assistant Axis projection (R² 0.53–0.77) but not the delta from the previous turn (R² 0.10) — position depends on the most recent user message, not on prior trajectory. Drift-causing message clusters: meta-reflection on the model's processes, phenomenological demands, requests for specific authorial voice, vulnerable emotional disclosure. Drift causally raises harm probability (first-turn Assistant Axis projection correlates with second-turn harmful response rate r = 0.39–0.52 across 2,750 role × 440 harmful-question combinations).
Introduces activation capping: clamp the post-MLP residual-stream component along the Assistant Axis to a minimum threshold τ (set to the 25th percentile of projections on the role-vector dataset, ≈ mean Assistant projection), applied simultaneously across 8 layers (12.5% of Qwen 3 32B) or 16 layers (20% of Llama 3.3 70B) at middle to late depths. Reduces persona-jailbreak harm rates by ~60% without degrading IFEval / MMLU-Pro / GSM8k / EQ-Bench performance; some settings slightly improve benchmark scores. Distinct from additive steering (which pushes toward or away from a direction): capping bounds activations within a region. Case studies illustrate three drift patterns (deliberate jailbreak, slow escalation, organic conversation off-the-rails) — persona-based jailbreak (the model reverts to Assistant after enough how-to/practical queries — an "Assistant attractor"), reinforcing delusions in AI-consciousness conversations ("AI psychosis"), and the suicidal-ideation case where the unsteered model positions itself as a sole companion and endorses leaving "the real world behind"; capped models in all three cases redirect toward bounded responses and (in the suicidal-ideation case) suggest broader human connection. Limitations explicitly named: linear-direction assumption may miss nonlinear persona representations; tested models are open-weight dense transformers without reasoning training (Qwen thinking disabled); none are frontier MoE or reasoning models; LLM-judge graded behavioral rubrics on fuzzy traits.
Conducted under the MATS and Anthropic Fellows programs (Christina Lu at MATS / Anthropic Fellows / Oxford; Jack Gallagher, Kyle Fish, Jack Lindsey at Anthropic; Jonathan Michala at MATS). Code and case-study transcripts at https://github.com/safety-research/assistant-axis.