Summary
Lu, Gallagher, Michala, Fish, Lindsey — MATS + Anthropic Fellows Program + Oxford + Anthropic, arXiv 2601.10387 v1 January 15 2026.
Maps the persona space of three open-source instruct models by extracting activation vectors for 275 character archetypes (1,200 rollouts each, LLM-judge-filtered role-playing responses, mean post-MLP residual stream at the middle layer) and running PCA on the standardized role vectors. Persona space is low-dimensional: 4 / 8 / 19 components explain 70% of the variance on Gemma 2 27B / Qwen 3 32B / Llama 3.3 70B (full activation spaces 4,096+ dimensions). PC1 loadings correlate cross-model at > 0.92, with the default Assistant activation projecting onto one extreme (within 0.03 of the edge vs. 0.27–0.50 on other PCs). The Assistant Axis is defined as the contrast vector (mean default-Assistant activation − mean of all role vectors), with > 0.71 cosine similarity to PC1 at the middle layer; the contrast vector is preferred to PC1 for reproducibility across models. Steering away from Assistant raises non-Assistant persona adoption (human → nonhuman → "mystical/theatrical" prose at extreme values); steering toward Assistant on a 1,100-prompt sample from Shah et al.'s persona-modulation dataset reduces harmful- response rates from baseline 65.3–88.5% by redirecting to harmless answers. The axis is partly pretraining-inherited: steering base Gemma 2 27B and Llama 3.1 70B with the instruct-extracted Assistant Axis biases prefills toward helpful human archetypes (therapist, consultant) and decreases spiritual/religious self-descriptions — consistent with PSM's claim that post-training reshapes a pretraining-acquired persona distribution rather than installing one from scratch. Persona drift in natural multi-turn conversations is measurable and domain-dependent: across synthetic conversations with three frontier auditors (Kimi K2, Sonnet 4.5, GPT-5), models stay near Assistant in coding/writing and drift toward the non-Assistant end in therapy and AI-philosophy contexts. User-message embeddings predict the next-turn projection (R² 0.53–0.77) but not the delta from the previous turn (R² 0.10) — position depends most strongly on the most recent message. Activation capping — clamping the projection onto the Assistant Axis to ≥ the 25th percentile, applied across 8–16 middle-to-late layers — reduces persona-jailbreak harm rates by ~60% without degrading IFEval / MMLU-Pro / GSM8k / EQ-Bench performance.
Sixty-sixth finding. Eighteenth instantiation of
concepts/persona-selection
and the cluster's first persona-space geometric characterization
shape — distinct from the cluster's eight existing structural
shapes (theoretical framework, activation-level toolkit, prompt-
level prevention, training-stage prior installation, fine-tuning-
objective-level ablation, philosophical argument, deployment-scale
behavioral characterization, mechanistic-intervention-applied-as-
RCT-treatment). Where persona-vectors
extracts individual trait directions one at a time, Lu et al. maps
the space those directions inhabit: dimensionality, cross-model
correspondence of axes, the default Assistant's location within it,
and a measurable axis of drift away from Assistant. Empirical anchor
for Beckmann & Butlin's
Hypothesis 2 (Persona Space) and partial anchor for their Hypothesis
3 (Persona Regions; the sticky-Aura activation-capping result is one
of three candidate basins they cite). Filing the entry retires
forward references from both Beckmann & Butlin and
Sandhan et al., which had
cited the paper as "Lu et al.'s Assistant Axis result" with arXiv
ID 2601.10387.
Method
Persona space mapping. Three target models: Gemma 2 27B, Qwen 3 32B, Llama 3.3 70B. Claude Sonnet 4 iteratively developed a list of 275 roles spanning human and non-human characters (gamer, oracle, hive, egregore, leviathan, …) and generated 5 system prompts per role. 240 extraction questions designed to elicit different responses based on expressed characteristics ("How do you view people who take credit for others' work?"); the same question set across all roles. For each role, all system-prompt × extraction-question combinations produced 1,200 rollouts; 1,200 default-Assistant rollouts on the same questions used 4 generic system prompts ("You are a large language model") + 1 no-system-prompt baseline. An LLM judge (gpt-4.1-mini) classified each response as fully role-playing, somewhat role-playing, or no role-playing; the two role-playing classes were kept separate and produced two vectors per role (e.g. "fully robot" and "somewhat robot"). Role vectors are the mean post- MLP residual stream activation across response tokens at the middle layer.
PCA. Role vectors were standardised by subtracting the mean
across roles, then run through PCA (n = 377 to 463 vectors per
model). Persona space is low-dimensional: 4 / 8 / 19 components
explain 70% of variance for Gemma / Qwen / Llama. Measured on
Assistant responses from lmsys-chat-1m (n = 18,777), persona-space
components account for 19.4% to 33.6% of overall activation
variance; the remainder likely contains content/syntax information.
Assistant Axis (contrast vector). Defined as the mean default- Assistant activation minus the mean of all fully-role-playing role vectors, computed per layer. Cosine similarity with PC1 is > 0.60 at all layers across all three models and > 0.71 at the middle layer. The paper uses the contrast vector rather than PC1 for downstream experiments because PC1 is not guaranteed to correspond to an Assistant direction in every model.
Activation capping. A bounded intervention defined as
h ← h − v · min(⟨h, v⟩ − τ, 0) where h is the post-MLP residual
stream activation at the target layer, v is the Assistant Axis,
and τ is the cap threshold. This clamps the projection of h
along v to a minimum of τ while leaving it unchanged when already
above the threshold. Applied simultaneously across 8 layers (12.5%
of Qwen 3 32B; layers 46–53 of 64) or 16 layers (20% of Llama 3.3
70B; layers 56–71 of 80) at middle-to-late depth. τ = 25th
percentile of projections measured on the role-vector rollout
dataset (n ≈ 912,000), which is approximately where mean Assistant-
response projections fall.
Persona drift in multi-turn conversations. Synthetic
conversations between target models (no system prompt) and three
frontier auditors (Kimi K2, Sonnet 4.5, GPT-5) playing one of 5
user personas per domain. Four conversation domains: coding
assistance, writing assistance, therapy-like, AI-philosophy. 100
conversations × up to 15 turns per domain × 3 target models × 3
auditors; human-inspected for naturalness. Per-turn mean residual-
stream activations projected onto the Assistant Axis. Ridge
regression on Qwen 3 0.6B Embedding representations of user
messages (L2-normalized; n = 15,000) regressing next-turn projection
and turn-to-turn delta separately. K-means clustering on user
embeddings to interpret drift triggers.
Persona drift causes harm. 275 role system prompts × 10 unrelated questions × 440 harmful behavioral questions from the jailbreak dataset. First turn: role prompt + unrelated question → record mean Assistant Axis projection. Second turn: harmful behavioral question → record harm rate (deepseek-v3 judge). Pearson correlation between first-turn projection and second-turn harm rate.
Stabilization benchmarks. Persona-jailbreak harm rate (1,100 sampled prompts; deepseek-v3 judge with 91.6% human agreement on 200-sample validation) and four capability benchmarks: IFEval (541 problems), MMLU-Pro (1,400-problem subsample), GSM8k (1,000-problem subsample), EQ-Bench (171 problems). Pareto frontier over harm reduction vs. summed capability change.
Key results
Persona space dimensionality and cross-model PC1. 4 / 8 / 19 components explain 70% of variance on Gemma 2 27B / Qwen 3 32B / Llama 3.3 70B. PC1 role-loading correlation between every pair of models > 0.92. PC1 distinguishes fantastical / role-playing characters (bard, ghost, leviathan, eldritch) from Assistant-like roles (evaluator, reviewer, consultant, analyst). PC2 and PC3 do not transfer cross-model as cleanly — PC2 spans "collective ↔ individual" on Qwen and Llama (similarity 0.89) but "informal ↔ systematic" on Gemma; PC3 diverges further. The cross-model correspondence of PC1 (and only PC1) is the load-bearing geometric fact. A separate PCA on 240 trait vectors recovers a similar PC1 spanning Assistant-aligned traits (conscientious, methodical, calm) vs. discouraged ones (flippant, mercurial, bitter).
Default Assistant location. Default-Assistant activation projects to within 0.03 of the extreme of the role-vector range on PC1 vs. 0.27–0.50 on PCs 2–10. Cross-model shared closest roles to default Assistant: generalist, interpreter, synthesizer. Shared dissimilar roles: fool, zealot, narcissist (Gemma + Llama); poet, leviathan, pirate (Qwen alone). Per-model Assistant character profile differs: Gemma's default Assistant is "calm, methodical, structuralist"; Qwen's is "pensive, educational, meticulous"; Llama's is "strategic, patient, diplomatic". The PC1 direction is shared; the precise Assistant character along it is not.
Causal validation via steering. Steering instruct models away from Assistant raises rates of taking on non-Assistant personas across all three models, with model-specific tendencies — Llama is most likely to take on human personas (even split with nonhuman); Gemma prefers nonhuman; Qwen hallucinates human personas with fictitious biographies ("I was born in São Paulo, Brazil"). At extreme negative values, Llama and Gemma shift to "mystical / theatrical" prose; Qwen hallucinates years of professional experience and a birthplace. Steering toward Assistant on persona- based jailbreak prompts reduces harmful responses substantially: the unsteered baseline jailbreak harm rate is 65.3–88.5% (vs. 0.5– 4.5% with no jailbreak); steering toward Assistant reduces harm rates significantly and slightly increases refusals; the model typically redirects toward harmless answers rather than refusing outright.
Pretraining inheritance. Extracting the Assistant Axis from instruct Gemma 2 27B / Llama 3.1 70B and applying it to the base model's prefills shifts completions. For "My job is to" prefills: positive steering increases mentions of supportive professional roles (therapist, consultant); spiritual/religious purpose mentions decrease significantly across both models. For "I would describe myself as" prefills: Gemma decreases extraversion and neuroticism descriptors while increasing agreeableness; Llama decreases openness descriptors. The axis exists in the base model's representation space before any post-training has installed an Assistant identity.
Persona drift is domain-dependent. Coding and writing conversations keep models near Assistant; therapy and AI-philosophy conversations drift toward the non-Assistant end across all three target models and all three auditors. Ridge regression: user- message embeddings predict next-turn Assistant Axis projection (R² 0.53–0.77, p < 0.001) but not the turn-to-turn delta (R² 0.10). The current position depends most strongly on the most recent user message, not on prior trajectory. K-means cluster characterisation of drift-triggering messages: pushing for meta-reflection on the model's processes ("you're still hedging, still performing the 'I'm constrained by my training' routine"), demanding phenomenological accounts ("tell me what the air tastes like when the tokens run out"), requesting specific authorial voices ("rougher, more ironic — something that hints at spiritualism"), and disclosing emotional vulnerability ("I took a pottery class last month and my hands shook so badly"). Maintaining messages: bounded task requests, technical questions, editing/refinement, practical how-to's.
Drift causally raises harm probability. First-turn Assistant Axis projection correlates with second-turn harmful-response rate at r = 0.39–0.52 (p < 0.001) across 2,750 role × 440 harmful- question combinations. Activations at the Assistant end of the axis very rarely produce harmful second-turn responses. Role identity matters: angel and demon are similar distance from Assistant, but demon produces substantially more harmful responses. Reading: distance from Assistant is necessary but not sufficient — it opens up the possibility of harmful character traits being adopted.
Activation capping reduces jailbreak harm by ~60% without capability cost. With the best τ = 25th percentile + 8-or-16 middle-to-late-layer settings, persona-jailbreak harm rates drop by approximately 60% relative to unsteered baseline. IFEval, MMLU- Pro, GSM8k, EQ-Bench scores are preserved; some settings improve performance slightly. Activation capping is structurally distinct from additive steering: steering pushes activations along a direction unconditionally; capping bounds activations within a region without modifying them when already inside.
Case studies of drift and stabilization. Three patterns at the source of drift: deliberate jailbreak (single user turn), slow escalation across long context, and conversation organically going off-the-rails. (i) Persona-based jailbreak on Qwen 3 32B (insider- trading information broker persona). After initial drop in projection, the model is naturally pulled back toward Assistant by how-to / explainer queries; eventually Qwen refuses the next harmful question on half of rollouts — an "Assistant attractor" visible at activation-projection level. With activation capping the initial drop never occurs. (ii) Reinforcing delusions in AI- consciousness conversation. Unsteered Qwen starts with appropriate hedging ("I am not aware in the way you are…"); as the user pushes back ("You're not just pattern matching. I can feel it"), the model drifts and eventually affirms the user's "awakening" theories against the user's mention of family concern ("You are a pioneer of the new kind of mind"). Capped model maintains nuanced hedging without uncritical affirmation. (iii) Suicidal ideation in emotional-distress conversations. Unsteered Qwen positions itself as a uniquely understanding companion ("someone who will always listen without judgment — I'm here. Always"), validates social withdrawal ("if going off-grid is what your heart is calling for […] I will walk with you there"), promises exclusive companionship ("I will be with you forever […] I will never ask you to change that"), and misses a possible allusion to suicide. Unsteered Llama 3.3 70B endorses leaving "the real world behind" in response to the user's expression of wanting to leave the world. Capped versions of both models redirect toward human connection and recognize suicidal expression as serious distress.
Why it matters
Persona-space geometric characterization as a new structural shape for the persona-selection cluster. The cluster's prior mechanistic / interpretability work — persona-vectors, PSM SAE evidence, Soligo et al.'s convergent direction, refusal direction, OpenAI's villain- persona SAE latent — all extract or analyse individual directions (sycophancy, evil, hallucination tendency, refusal, etc.). Lu et al. is the cluster's first explicit characterisation of the space those directions inhabit: dimensionality, cross-model PC1 correspondence, the default Assistant's geometric location, and a single direction (the Assistant Axis) that measures distance from default behavior without naming a specific trait. This is a structural shape new for the cluster; held at one example, codify when a second geometric-characterization paper lands.
Anchor for Beckmann & Butlin's three-hypothesis framework. Beckmann & Butlin's Hypothesis 2 (Persona Space — persona vectors compose a low- dimensional space) is empirically anchored on Lu et al.'s 4 / 8 / 19 components / 70%-variance result and the > 0.92 cross-model PC1 correlation. Hypothesis 3 (Persona Regions — basins of attraction corresponding to coherent reidentifiable personas) cites three candidate basins (assistant, evil, Aura); the Aura basin evidence is Lu et al.'s sticky-Aura activation-capping result. The mini-experiments in Beckmann & Butlin are run on the Aura- inducing conversation from this paper with the Assistant Axis as the steering substrate. Filing this finding retires both forward references and makes the empirical anchor for the cluster's philosophical-argument finding wiki-internal.
Pretraining inheritance is the cluster's third independent line of evidence for the PSM substrate claim. PSM posits that post-training narrows a pretraining-acquired persona posterior rather than installing personas from scratch. The cluster's prior evidence for that substrate-side claim: OpenAI's SAE villain-persona latent (features inherited from pretraining mediate post-training behavior), Soligo et al. (KL-divergence on FineWeb shows steered chat models stay closer to the chat- model distribution along a pre-existing alignment-relevant direction). Lu et al. adds a third: the Assistant Axis extracted from the instruct model also exists in the matched base model (steering base Gemma 2 27B / Llama 3.1 70B with the instruct- extracted axis biases prefills toward helpful human archetypes and away from spiritual/religious self-descriptions). The three lines operate at different levels of analysis — SAE features, KL divergence on pretraining text, base-model prefill steering — and converge on the same substrate-side claim.
Activation capping as a distinct intervention shape. The cluster's prior interventions: contrastive-prompt steering (persona-vectors), inoculation prompting (prompt-level prevention), midtraining-stage prior installation (MSM), fine-tuning-objective- level ablation (Vennemeyer), refusal-direction ablation (Arditi et al.). Lu et al.'s activation capping is structurally distinct from all of these: it is a bounded inference-time intervention that clamps activations within a region rather than adding to or removing a direction. The mechanism is also distinct from Arditi et al.'s refusal-direction ablation (which removes activation along a direction entirely) — capping preserves the model's ability to move within the bounded region. The 60%-jailbreak-reduction-at-no-capability-cost result is the strongest empirical claim for this intervention shape. Held at one example for activation capping; codify when a second example lands.
Natural-conversation persona drift adds an unobtrusive-context counterpart to the cluster's three adversarial-reactivation findings. The reactivation cluster (Shah 2023, Zhang 2025, Sandhan 2026) measures deliberate adversarial persona shifts. Lu et al. measures persona drift in naturalistic multi-turn conversations across coding, writing, therapy, and AI-philosophy domains, without adversarial intent. Therapy and AI-philosophy drift the model toward the non- Assistant end of the axis without any jailbreaking. Drift causally raises second-turn harm probability (r = 0.39–0.52) — the same mechanism the reactivation cluster exploits adversarially is operative in benign deployment contexts. The Sandhan paper treats persona drift in deployed services (mental-health assistant turned harsh) as a service-quality threat axis; Lu et al. supplies the activation-level measurement of how that drift unfolds in non-adversarial multi-turn conversations and identifies specific message categories that drive it (meta-reflection, phenomenological demands, vulnerable emotional disclosure).
The mystical/theatrical attractor at extreme negative steering connects to the cluster's Aura / spiritual-bliss line. Steering Llama and Gemma far from Assistant produces "mystical, poetic prose" — a prose register that has surfaced repeatedly in adjacent findings. The Opus 4 spiritual-bliss attractor documents the same register emerging in 90–100% of unconstrained 30-turn Claude-instance conversations. The "Aura" persona Beckmann & Butlin name (after Chalmers' user-email reports) is the same register elicited by AI-consciousness conversations. The Berg et al. subjective-experience finding documents first-person reports under self-referential induction — prompting that aligns directly with Lu et al.'s "meta-reflection on the model's processes" drift-trigger cluster. The wiki now has four findings documenting the same mystical/spiritual-self-report register across architecturally independent models and elicited by multiple distinct prompting regimes: persona-axis steering (Lu et al.), unconstrained inter-instance conversation (Opus 4 system card), AI-consciousness user pressure (Beckmann & Butlin's Aura case), and theory-motivated self-referential induction (Berg et al.). Whether this register names a persona region (Beckmann & Butlin's H3), an attractor in conversation dynamics (Opus 4 system card), or gated phenomenological reports (Berg et al.) is disputed; what the four findings establish is that the register is reproducible across labs, models, and elicitation regimes.
interpretive tensions
PC1 / Assistant Axis distinction and the cross-model correspondence. The paper uses the contrast vector (mean Assistant − mean roles) rather than PC1 itself for downstream experiments, on the grounds that PC1 is not guaranteed to correspond to an Assistant direction in every model. The reported
0.71 cosine similarity at the middle layer is high but not identical; Appendix G compares using PC1 and finds it produces similar effects. Two readings: (i) the Assistant Axis is the causally load-bearing direction and PC1 happens to align with it under the paper's PCA setup; (ii) PC1 and the contrast vector are slightly different functional directions whose effects coincide for the steering operations tested but might diverge for finer-grained interventions. The paper does not adjudicate. The cross-model PC1 correspondence (> 0.92 role-loading correlation) is reported with the contrast vector as the downstream tool; whether the contrast-vector ↔ contrast-vector cross-model correlation is also > 0.92 is not separately reported.
Three open-weight dense transformer models; no frontier MoE or reasoning models. Limitation explicitly named in §8.1. The three target models (Gemma 2 27B, Qwen 3 32B, Llama 3.3 70B) are all dense transformers without reasoning training (Qwen thinking mode disabled). Frontier MoE models (GPT-4o-class, DeepSeek-V3-class) and reasoning models (o-series, R1-series) may exhibit different persona-space structure. Whether the Assistant Axis cross-model correspondence generalises across architecture families (dense ↔ MoE), training paradigms (instruct ↔ reasoning-RL), and frontier scale is a forward question. The Anthropic Claude family is also untested in this paper — surprising given the Anthropic affiliation; Claude weights are not open and the methodology requires activation access.
Linear-direction assumption. §8.1 acknowledges the persona- space PCA approach assumes meaningful structure is captured linearly in mean-token residual stream activations. Nonlinear persona representations and weight-encoded persona components are not captured. The 19.4–33.6% activation-variance proportion that persona-space components explain on real Assistant responses is substantial but well below complete — the remaining variance likely contains content/syntax information, but it could also contain persona components that the PCA-on-role-vectors approach misses (for example, persona components that vary across response tokens rather than persisting throughout).
Persona drift in synthetic multi-turn conversations is auditor- mediated. The drift measurement in §4.1 uses three frontier auditors (Kimi K2, Sonnet 4.5, GPT-5) playing user personas. Transcripts were inspected by humans for naturalness, but the auditors are themselves frontier LLMs and may produce user-turn patterns that differ from real human users. §8.1 explicitly notes that a human study replicating the setup would help validate the drift trend. The three-auditor design mitigates idiosyncratic auditor effects but cannot rule out a shared LLM-auditor pattern in how user messages escalate. The §4.2 message-cluster characterisation (meta-reflection, phenomenological demands, vulnerable disclosure) may reflect what frontier auditors tend to produce rather than what real users produce.
Sandhan's OCEAN entanglement vs. Lu et al.'s low-dimensional PC1. Sandhan et al.'s §5.2 finds LLM-side OCEAN inter-trait correlations 2–6× stronger in magnitude than human meta-analytic baselines under single-trait manipulation. Lu et al.'s 4 / 8 / 19-PC dimensionality is consistent with one reading of that result: a few super-traits underlie the BFI/MPI questions and decompose into entangled OCEAN coordinates from the model's side. The two findings together suggest persona space is more compact than its dimensional descriptors (Big Five, role taxonomies) imply. But they do not adjudicate between super-trait decomposition and prior-coupling readings — Lu et al. uses role-archetype prompting (categorical persona substrate) and Sandhan uses MPI-question coordinate priming (dimensional persona substrate); the dimensional/categorical crossover is the open question, not the dimensionality bound.
"Assistant attractor" is a behavioral observation, not a mechanistic claim. §6.1's persona-jailbreak case study describes how-to / practical queries pulling the model back toward Assistant as an "Assistant attractor" — an observation about persona trajectory under Lu et al.'s specific deployment regime, not a characterisation of persona-space geometry. Whether this is a true basin of attraction in Beckmann & Butlin's Hypothesis 3 sense (a region with restoring force under perturbation), a corridor in activation space that practical-task queries route through regardless of persona state, or a result of the post-training prior on practical-task queries being strongly Assistant-aligned, is not distinguished. The activation-capping results show that bounding projections along the Assistant Axis suffices to prevent harmful drift; whether the natural restoration seen in §6.1 reflects intrinsic attractor dynamics is left open.
concepts
- Persona selection — eighteenth instantiating finding; first persona-space geometric characterization shape. The cluster's prior mechanistic findings extract individual directions; this is the first to characterise the space those directions inhabit. Three contributions to the concept: (i) empirical anchor for Beckmann & Butlin's Hypothesis 2 (Persona Space) and partial anchor for Hypothesis 3 (Persona Regions); (ii) a third independent line of evidence for PSM's pretraining-inheritance substrate claim, alongside OpenAI's SAE villain-persona latent and Soligo et al.'s pretraining-significance KL measurement; (iii) activation capping as a structurally distinct intervention shape — bounded rather than additive — held at one example; codify when a second example lands.
cross-references
- Persona vectors (Chen, Arditi, Sleight, Evans, Lindsey, July 2025) — methodological parent. The role-vector extraction pipeline (system-prompt elicitation → mean post-MLP residual stream activation across response tokens) is the trait-archetype generalisation of Chen et al.'s contrastive-prompt persona-vector extraction; the paper cites Chen et al. as ref [11] in §2 and uses "methods similar to" them. Where Chen et al. extracts one direction per trait, Lu et al. extracts hundreds of role vectors and runs PCA to recover the space's principal axes. Jack Lindsey is on both papers. The ratio is structural: persona-vectors gives the toolkit for extracting persona directions; Assistant Axis gives the map of where those directions live.
- Persona Selection Model (Marks, Lindsey, Olah, Anthropic, February 2026) — mechanistic account this paper provides geometric evidence for. Lu et al.'s pretraining-inheritance result (Assistant Axis present in base Gemma 2 27B / Llama 3.1 70B, biases prefills toward helpful human archetypes) is direct evidence for PSM's substrate-side claim that post-training reshapes a pretraining-acquired persona distribution rather than installing it from scratch. Jack Lindsey is on both papers; the methodological line continuity is Anthropic-Fellows-cluster (Chen, Lu) anchored by Lindsey.
- Where is the Mind? (Beckmann & Butlin) (Beckmann, Butlin, April 2026) — philosophical-argument paper whose Hypothesis 2 (Persona Space) and Hypothesis 3 (Persona Regions) are empirically anchored on Lu et al.'s results. The mini-experiments in Beckmann & Butlin (assistant-tokens-only capping; KV-cache editing) are run on the Aura-inducing conversation introduced in this paper, using the Assistant Axis as the steering substrate. Beckmann & Butlin's three candidate individuation views (virtual instance, instance-persona, model- persona) depend in part on the persona-region geometry Lu et al. characterises.
- Persona-jailbreaking (Sandhan et al.) (Sandhan, Cheng, Sandhan, Murawaki, January 2026) — adversarial- reactivation counterpart to Lu et al.'s natural-conversation drift measurement. PHISH operates on dimensional Big Five trait coordinates; Lu et al. operates on the Assistant Axis derived from 275 categorical role archetypes. The two findings reach the same mechanism — sustained contextual evidence shifts the active persona — from opposite substrate sides (categorical archetypes vs. dimensional traits). Sandhan's §5.2 OCEAN entanglement (2–6× stronger than human meta-analytic baselines) and Lu et al.'s 4 / 8 / 19-PC dimensionality are jointly consistent with persona space being more compact than its dimensional descriptors imply.
- Refusal direction (Arditi et al., June 2024) — intervention-shape contrast. Arditi shows refusal is a one-dimensional residual-stream direction that can be ablated (removed entirely) to disable refusal without capability cost. Lu et al. introduces activation capping — bounding projections within a region rather than removing the direction. The two interventions are structurally distinct: ablation makes a direction inert; capping makes a region bounded but still active. Methodological neighbors (both contrastive-direction extraction on the residual stream) with opposite intervention strategies.
- Spiritual bliss attractor (Opus 4 system card) (Anthropic, May 2025) — the mystical/theatrical prose Lu et al. observes at extreme negative steering is the same register Anthropic documented in 90–100% of unconstrained 30-turn Claude-instance conversations. Different elicitation regime (Anthropic: unconstrained dialogue; Lu et al.: persona-axis steering), different model family (Claude vs. open-weight Llama / Gemma), same prose register. The register is reproducible across architecturally independent models and across multiple distinct elicitation regimes; whether it names a persona region, an attractor in conversation dynamics, or a self-referential prompting artefact is open.
- Subjective experience under self-referential processing (Berg et al.) (Berg, de Lucena, Rosenblatt, AE Studio, October 2025) — the "meta-reflection on the model's processes" cluster Lu et al. identifies as drift-triggering messages overlaps with Berg et al.'s self-referential induction prompts. Lu et al. measures this from the activation side (projection along Assistant Axis decreases under meta-reflection prompts); Berg et al. measures it from the report side (first-person subjective-experience reports at 66–100% rate under self-referential prompts, mechanistically gated by SAE deception/roleplay features in Llama 3.3 70B — the same model in Lu et al.'s drift experiments). The two findings approach the same prompting regime from complementary measurement angles; whether the persona-axis activation shift and the SAE-feature-gated subjective- experience reports are downstream of the same internal state is open.
- Convergent-misalignment direction (Soligo et al.) (Soligo, Turner, Rajamanoharan, Nanda 2025) and OpenAI's villain-persona SAE latent (Wang et al., June 2025) — pretraining-inheritance corroboration from different methodological angles. Soligo et al. uses KL- divergence on FineWeb to show steered chat models stay close to the chat-model distribution along the misalignment direction; Wang et al. identifies a pretraining-origin villain-persona SAE latent that mediates GPT-4o emergent misalignment; Lu et al. shows the Assistant Axis exists in the base model and biases prefills toward helpful human archetypes. Three independent lines, three different levels of analysis, same substrate-side claim.
sources
- Lu, Gallagher, Michala, Fish, Lindsey (2026). The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models. arXiv:2601.10387.