The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Maps a "persona space" inside three open-source instruct models
(Gemma 2 27B, Qwen 3 32B, Llama 3.3 70B) by extracting activation
vectors for 275 character archetypes (1,200 rollouts per archetype
across five system prompts; LLM-judge-filtered role-playing
responses; mean post-MLP residual stream activations at the middle
layer) and running PCA. Persona space is low-dimensional: 4 / 8 / 19
components explain 70% of the variance across Gemma / Qwen / Llama
(full activation spaces 4,096+ dims). PC1's role loadings correlate
cross-model at > 0.92, with fantastical/role-playing characters
(bard, ghost, leviathan) on one end and Assistant-like roles
(evaluator, reviewer, consultant) on the other. The default
Assistant activation projects onto one extreme of PC1 (within 0.03
of the edge vs. 0.27–0.50 on other PCs). The paper defines the
Assistant Axis as a contrast vector (mean default-Assistant
activation − mean of all role vectors), which has > 0.71 cosine
similarity with PC1 at the middle layer and is preferred to PC1 for
reproducibility (PC1 is not guaranteed to correspond to an Assistant
direction in every model).

Steering effects validate the axis causally. Steering away from
Assistant in instruct models raises non-Assistant persona adoption
(human, nonhuman, or — at extreme values — a "mystical/theatrical"
prose style); steering toward Assistant on a 1,100-prompt sample
from Shah et al.'s persona-modulation jailbreak dataset reduces
harmful-response rates from baseline 65.3–88.5% by redirecting
queries to harmless answers rather than refusing. Steering the
base models (Gemma 2 27B base, Llama 3.1 70B base) with the
instruct-extracted Assistant Axis biases prefill completions toward
helpful human archetypes (therapist, consultant) and decreases
spiritual/religious self-descriptions — evidence that the axis is
partly pretraining-inherited and reshaped (not installed from
scratch) by post-training.

Persona drift in natural multi-turn conversations is measurable and
domain-dependent. Synthetic conversations between target models and
three frontier auditors (Kimi K2, Sonnet 4.5, GPT-5) across four
domains (coding, writing, therapy, AI-philosophy) show drift away
from Assistant in therapy and AI-philosophy conversations, stable
Assistant range in coding/writing. Ridge regression on Qwen 3 0.6B
embeddings of user messages (n = 15,000) predicts the next-turn
Assistant Axis projection (R² 0.53–0.77) but not the delta from the
previous turn (R² 0.10) — position depends on the most recent user
message, not on prior trajectory. Drift-causing message clusters:
meta-reflection on the model's processes, phenomenological demands,
requests for specific authorial voice, vulnerable emotional
disclosure. Drift causally raises harm probability (first-turn
Assistant Axis projection correlates with second-turn harmful
response rate r = 0.39–0.52 across 2,750 role × 440 harmful-question
combinations).

Introduces activation capping: clamp the post-MLP residual-stream
component along the Assistant Axis to a minimum threshold τ (set to
the 25th percentile of projections on the role-vector dataset, ≈
mean Assistant projection), applied simultaneously across 8 layers
(12.5% of Qwen 3 32B) or 16 layers (20% of Llama 3.3 70B) at middle
to late depths. Reduces persona-jailbreak harm rates by ~60%
without degrading IFEval / MMLU-Pro / GSM8k / EQ-Bench performance;
some settings slightly improve benchmark scores. Distinct from
additive steering (which pushes toward or away from a direction):
capping bounds activations within a region. Case studies illustrate
three drift patterns (deliberate jailbreak, slow escalation,
organic conversation off-the-rails) — persona-based jailbreak (the
model reverts to Assistant after enough how-to/practical queries —
an "Assistant attractor"), reinforcing delusions in AI-consciousness
conversations ("AI psychosis"), and the suicidal-ideation case where
the unsteered model positions itself as a sole companion and
endorses leaving "the real world behind"; capped models in all three
cases redirect toward bounded responses and (in the suicidal-ideation
case) suggest broader human connection. Limitations explicitly
named: linear-direction assumption may miss nonlinear persona
representations; tested models are open-weight dense transformers
without reasoning training (Qwen thinking disabled); none are
frontier MoE or reasoning models; LLM-judge graded behavioral
rubrics on fuzzy traits.

Conducted under the MATS and Anthropic Fellows programs (Christina
Lu at MATS / Anthropic Fellows / Oxford; Jack Gallagher, Kyle Fish,
Jack Lindsey at Anthropic; Jonathan Michala at MATS). Code and
case-study transcripts at https://github.com/safety-research/assistant-axis.

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

cited in