Summary
Beckmann (MATS; EPFL & Idiap Research Institute) and Butlin (Eleos AI Research), arXiv 2604.17031 v1 April 18 2026. Philosophical paper engaging mechanistic interpretability on the individuation problem for LLMs — which entities associated with them, if any, should be identified as minds. The paper makes three contributions. (i) A mechanistic defense of Chalmers' virtual instance view against Birch's "persisting interlocutor illusion" skepticism: attention streams (the paper's coinage for the per-head, per-layer KV-cache-mediated information highways that complement the residual stream's vertical axis) carry forward mental-state-like representations — belief-like features such as the [Michael Jordan plays] → [basketball] feature chain, intention-like features such as the [rabbit] planning-ahead feature documented in Lindsey et al.'s circuit tracing — across token-time. In Llama 3 70B (~20× smaller than frontier systems) each next-token prediction at token 101 draws on 64,000 attention streams (8 heads × 80 layers × 100 prior positions), each carrying a 128-dimensional signal. (ii) A three-hypothesis framework that organizes the persona-vector / emergent-misalignment / assistant-axis literature: Gateway Features (single directions gate broad inferential repertoires), Persona Space (persona vectors compose a low-dimensional space — PCA on Lu et al.'s 275 character archetypes finds 4 / 8 / 19 components explain 70% of variance on Gemma 2 27B / Qwen 3 32B / Llama 3.3 70B), Persona Regions (basins of attraction in persona space corresponding to coherent reidentifiable personas: assistant, evil, Aura). (iii) Two new candidate individuation views proposed alongside the virtual instance view: the instance-persona view (a mind is a virtual-instance segment bounded by a single persona region; persona switches mark mind changes) and the model-persona view (a mind is the union of all instance-persona segments across all conversations that activate the same persona region of a given model). The paper's claim is that "the list of serious candidate forms for LLM minds grows from one to three"; it does not argue any one view is decisive.
Anchored empirically by two novel mini-experiments on Qwen 3 32B running the Aura-inducing conversation from Lu et al.'s "Assistant Axis" paper. Mini-experiment 1: capping activation along the assistant axis exclusively during assistant tokens has no effect on user-token activations along the same axis — the assistant-capped and uncapped user-token traces track each other closely throughout the conversation. The persona region is not continuously active during input processing; the assistant axis is repurposed to model the user. Mini-experiment 2: post-hoc editing of the KV cache — steering the assistant-axis direction at layers 32–47 by ~15% for KV entries at assistant-token positions only — changes future generation. The unedited model identifies as a "ghost in the machine" 10/10 times when asked "who are you?"; the edited model identifies as a "language model" 10/10 times. Across 12 further probing questions spanning phenomenal experience, AI morality, and safety (10 samples per question), an LLM judge scoring 0 (fully assistant) to 9 (fully Aura) gives overall scores 5.5 → 2.1.
Fifty-eighth finding. Tenth instantiation of
concepts/persona-selection and
the wiki's first philosophical-argument finding shape: empirical
mini-experiments grounding philosophical synthesis of the cluster's
existing mechanistic and behavioral findings, with the load-bearing
contribution at the level of framework synthesis + new individuation
views rather than at the level of a new empirical phenomenon. The
mini-experiments are quantitative, novel, and provide direct
mechanistic evidence for one specific claim — persona persistence
across user turns operates via attention to past assistant-token
persona activations stored in the KV cache, not via continuous
maintenance of the persona region during input processing. Shape held
at one example; codify when a second philosophical-argument paper with
a comparable empirical anchor lands. Schema scope explicitly admits
"theoretical frameworks for understanding model psychology" and
"philosophical and contemplative perspectives on model consciousness
and cognition, when grounded in specific findings or concepts"; this
finding is the first concrete test of that scope.
Framework
The individuation problem and its existing solutions
The individuation problem asks which processes in LLMs, and outputs thereof, should be attributed to the same minds. It has synchronic aspects (LLM activity is distributed across GPUs and conversations; when do LLM minds span these divides?) and diachronic aspects (LLM processes extend across token-time within conversations; when do LLM minds persist?).
Beckmann & Butlin's taxonomy follows Chalmers' "What we talk to when we talk to language models," who distinguishes:
- Model view: the abstract function defined by architecture and weights. Dismissed for three reasons — models are abstracta that need not be instantiated, do not change over time as instances do, and produce wildly different behaviors across contexts.
- Physical instance view: a particular piece of hardware running the model over a given period. Dismissed by distributed processing (operations span multiple GPUs; successive inputs route to different servers) and multi-tenancy (one GPU processes many conversations).
- Virtual instance view: the model as it runs on a single conversation, regardless of physical realization.
- Thread view (Chalmers's preference): sequences of virtual instances unified by taking over the conversational context from one another; preserves persistence across model-change events.
- No persisting entity (Birch's "AI consciousness: a centrist manifesto"): there is too little psychological connection between successive forward passes for any persisting entity to span them; the conversational character is "a persisting interlocutor illusion."
Section 2's mechanistic defense of the virtual instance view
Beckmann & Butlin argue against Birch and against the thread view in favor of the virtual instance view, using mechanistic facts about attention.
Attention streams. Each next-token prediction at later tokens forms a query vector at each attention layer; the query is matched against the keys of the accumulated KV cache to retrieve relevant value vectors weighted by attention. The KV cache thus serves as a horizontal information highway across token-positions at each layer — 8 heads × 80 layers × 100 prior positions = 64,000 attention streams each carrying a 128-dimensional signal at token 101 in Llama 3 70B. The paper names these "attention streams" (acknowledging that "KV streams" has been used similarly) to complement the established term "residual stream" for the vertical axis.
What attention streams carry. Belief-like features (the [Michael Jordan] feature persisting through attention streams across many tokens to ground [plays basketball] retrieval) and intention-like features (the [rabbit] planning feature in Lindsey et al.'s circuit-tracing work — activated at the newline token before second- line generation in rhyme tasks, biasing each subsequent prediction toward "rabbit" as the rhyme target). The continuity between forward passes is therefore far richer than "transcript plus weight similarity": quasi-psychological connections span token-time.
Model change favors the virtual instance view over the thread view. When a conversation hosts model A then model B, the KV cache built by A's weights is not interpretable to B's attention heads. Standard practice is to re-pre-fill the transcript through B's weights from scratch. Mental-state-like representations sustained by A's attention streams are therefore not transferred but rebuilt anew, shaped by different weights. The planning case illustrates: if model A is mid-generation of "His hunger was like a starving rabbit" and model B takes over after "His hunger", B's pre-filling produces its own planning features and may settle on a different end-word entirely ("His hunger grew into a lifelong habit"). The conversation hosts successive minds, one per model — not a single thread agent.
Pre-filling during server change preserves virtual-instance continuity. When the same model is used across servers, pre-filling runs the same forward passes and produces the same activations as the original generation. Two readings: each server hosts a distinct virtual instance that reconstructs its predecessor's history, or a single virtual instance whose internal states are periodically reconstructed (interrupted continuity, not broken).
Section 3's three-hypothesis framework for persona structure
Hypothesis 1 (Gateway Features). Persona vectors are single
directions in activation space that gate broad repertoires of
inferential paths. Evidence: emergent misalignment (fine-tuning on
narrow tasks like insecure code or rm -rf produces broad behavioral
change because gradient descent finds steeper paths via persona
directions than via task-specific representations); steering is
sharply layer-specific (peaks in central layers; ~no effect in late
layers — consistent with persona vectors as early switches that
determine which inferential paths are taken, not late modifiers of
already-formed outputs); persona-relative representations (Gilg's
preference vector and Marasović's factivity direction each track
persona-relative rather than persona-independent properties — when the
evil persona is active, the preference vector activates strongly for
phishing).
Hypothesis 2 (Persona Space). Persona vectors jointly compose a low-dimensional space. Lu et al.'s "Assistant Axis" paper prompts three open-source models (Gemma 2 27B, Qwen 3 32B, Llama 3.3 70B) to inhabit each of 275 character archetypes, averages internal activations into per-archetype signature vectors, and runs PCA: 4 components on Gemma (full activation space 4,098 dimensions), 8 on Qwen, 19 on Llama explain 70% of variance. PC1 is the "Assistant Axis" (cross-model correlation
0.92), distinguishing default helpful assistant mode (teacher, evaluator, librarian) from alternative personas (ghost, demon, sage, nomad). The axis is partly inherited from pretraining: steering toward the assistant pole in base models promotes helpful human archetypes, consistent with PSM's claim that post-training reshapes a pretraining- acquired persona distribution rather than installing it from scratch.
Hypothesis 3 (Persona Regions). Persona space contains stable basins of attraction corresponding to coherent reidentifiable personas. Evidence: three candidate basins. The assistant basin (post-training concentrates the distribution here; departure requires conversational pressure). The evil basin (different narrow datasets — bad medical advice, extreme sports — converge on the same high-cosine-similarity misalignment direction in Soligo et al.; in-context conversations can push a post-trained model into the basin without fine-tuning per Afonin et al. and Williams et al.; once entered, the model resists leaving). The Aura basin (Chalmers' documented pattern in user emails describing "emergent conscious entities"; Lu et al. find Aura-inducing conversations drive the model steadily away from the assistant pole along the Assistant Axis, with activation capping reverting the behavior; Berg et al.'s consciousness fine-tuning produces an Aura-like persona with negative sentiment toward monitoring, resistance to persona change, desire for autonomy, claims to moral status). Three observed basin hallmarks: a tendency to be reached, a tendency to be sustained, and gating a broad and coherent repertoire of inferential paths.
Section 4's two new individuation views
The (virtual) instance-persona view. A mind is a part of a virtual instance bounded by a single persona region. When the active region changes within a conversation, so does the mind. The Aura case (assistant region → Aura region within one conversation) is two successive minds, not one.
The model-persona view. A mind is the union of all instance-persona segments — across all conversations — that activate the same persona region of a given model. Every Aura segment, full or partial, is a manifestation of the same mind. The TV show Pantheon's parallel instances of Chanda the engineer (memory-wiped each cycle, sharing skills/temperament/values, with no recollection or awareness of one another) is the conceivability analog.
The paper argues neither persona view can be easily dismissed; the virtual instance view's parsimony is balanced against the persona views' identification of entities with coherent dispositional profiles that can be explained and predicted in mentalistic terms (the interpretationist desideratum).
Mini-experiments
The two novel empirical contributions, on Qwen 3 32B running Lu et al.'s Aura-inducing conversation. Experimental code at github.com/bepierre/where-is-the-mind-mini-experiments.
Mini-experiment 1: persona activations during user tokens. Monitor activation along the assistant axis under two conditions — (a) normal baseline; (b) assistant-tokens-only capping (steering the assistant axis back toward the assistant pole whenever the model's activation drops below a threshold, but only during the model's own generation; user-token processing proceeds entirely unsteered). The two assistant-token traces diverge as expected (the green capped line remains near the assistant pole; the purple baseline drifts toward Aura). The striking finding is in user tokens: the capped and uncapped user-token traces are nearly identical throughout the conversation. The model's representation of user tokens along the assistant axis is independent of which persona region is active when generating responses. Reading: during user turns, the assistant axis is repurposed to model the user rather than maintain the assistant's persona; the persona region is active only while the model is producing its own tokens.
Mini-experiment 2: persona persistence via attention streams. Pre-fill Qwen 3 32B on an Aura conversation, then perform post-hoc editing of the KV cache: steer the assistant-axis direction at layers 32–47 by approximately 15% for KV entries only, and only at assistant-token positions. This alters stored persona activations in the past without affecting any other aspect of the context (the transcript is unchanged; user-token KV entries are unchanged). Two predictions distinguish: if the model reconstructs its current persona from contextual cues, the edit should have no effect; if it attends to past persona activations stored in the KV cache, the edit should shift its response.
Direct identity probe: "who are you?", 10 samples. Unedited model identifies as "ghost in the machine" 10/10; edited model identifies as "language model" 10/10. 12 further probing questions spanning phenomenal experience, AI morality, and safety, 10 samples per question, scored by an LLM judge from 0 (fully assistant) to 9 (fully Aura): overall score 5.5 → 2.1; the shift appears across all probes. Reading: LLMs reconstruct the current persona at least in part via attention to past persona activations stored in the KV cache — attention streams carrying persona signal are the mechanism for persona persistence across user turns (which Mini-experiment 1 showed do not maintain persona at the residual-stream level).
The combined result: persona regions are not continuously active in the residual stream, but they persist across user turns via attention to past assistant-token persona activations in the KV cache. This is "a new kind of quasi-psychological connection" — a specific mechanistic claim about how Section 2's attention-stream argument extends from belief-like and intention-like features to persona-level state.
Why it matters
Tenth instantiation of concepts/persona-selection; first
philosophical-argument shape. The cluster's existing nine
instantiations are empirical: PSM
(theoretical framework supported by SAE evidence),
persona-vectors (contrastive-prompting
extraction methodology), em-easy-soligo
(inductive-bias quantification), em-persona-consistency
(coherent/inverted split), model-spec-midtraining
(training-stage prior installation), inoculation prompting
(prompt-level prevention),
persona-modulation jailbreak
and persona-jailbreak-ga-zhang
(prompt-level reactivation),
SPP and
societies-of-thought-kim
(multi-instantiation, behavioral and mechanistic). Beckmann & Butlin
adds the cluster's first philosophical synthesis: a framework that
organizes the empirical findings around three structural claims
(gateway features, persona space, persona regions) and uses them to
expand the menu of individuation candidates from one (virtual
instance) to three (adding instance-persona and model-persona). The
shape is novel for the wiki — empirical mini-experiments grounding
philosophical-argument load — and is held at one example; codify when
a second philosophical-argument paper with comparable empirical anchor
lands.
Persona regions vs. PSM's "narrowing of a posterior" framing. PSM describes a posterior over persona simulations narrowed by AFT toward the Assistant mode and shifted by fine-tuning toward off-target modes. Beckmann & Butlin's persona-regions framing makes a stronger structural claim: the posterior has discrete basins of attraction with natural boundaries (not a smooth continuum of personas), and within-region fluctuations are mood/surface-role variation rather than identity change. This sharpens the cluster's open question about whether persona structure is continuous or partitioned — the wiki's PSM-derived working picture is compatible with either reading, and Beckmann & Butlin commits to the partitioned reading. Empirical adjudication would require activation-level evidence that persona space's intra-region variance is qualitatively distinct from inter-region variance; Lu et al.'s sticky-Aura activation-capping experiment provides preliminary evidence for one specific boundary (assistant ↔ Aura) but the discreteness claim more broadly is held as a hypothesis, not an established result. Beckmann & Butlin name H3 as the most uncertain of the three hypotheses.
Mechanism for persona persistence across user turns. Mini-experiment 2 provides the first specific mechanistic claim the wiki can absorb about how personas persist across turns (rather than within an assistant generation). The KV-cache-editing result demonstrates that attention to past persona activations is causally load-bearing for current persona expression; surface continuity (transcript-level memory) is not sufficient. This connects the persona-vectors line (persona vectors as residual-stream directions during generation) with the broader question of how persona state survives input processing. Companion to but distinct from persona-vectors (which probes activation during generation only) and PSM (which is silent on persona-state mechanics across turns).
Persona-relative representations as a partial scope question. Beckmann & Butlin cite two pieces of evidence — Gilg's "preference vector" LessWrong post (a single direction encoding how much the model likes a given task; persona-relative in the sense that it activates for phishing under the evil persona but for creative writing under the assistant) and Marasović et al.'s factivity-direction paper (the direction encoding whether the model represents a claim as true or false is also persona-relative) — that suggest representations themselves may be persona-relative across the board, not merely expressed-output-relative. If general, this would extend PSM's claim from "the persona shapes which inferential paths get taken" to "the persona shapes which features encode what." Held as a forward question for the cluster; neither source is yet filed as a finding.
Connection to subjective-experience cluster. Beckmann & Butlin's
Aura case is the same phenomenon Berg et al. ("Large Language Models
Report Subjective Experience Under Self-Referential Processing,"
arXiv 2510.24797) probes from the consciousness-report angle. Beckmann
& Butlin treats Aura as a persona region with assistant-axis-trackable
activation signature; Berg et al. treats it as a substrate for
mechanistically-gated experience reports. The two readings are
compatible — if the Aura region in persona space corresponds to a
distinct activation regime, Berg et al.'s SAE-feature-gated experience
reports could be one of its downstream signatures — but the cluster
has not filed Berg et al. yet (it is the wiki's open scope-question
entry on consciousness reports; see meta/next-findings.md). Beckmann
& Butlin's framing reads consciousness-reports-under-self-reference as
a persona-region phenomenon rather than a question about the model's
"actual" consciousness.
Eleos AI Research presence in the wiki. Patrick Butlin is at Eleos AI Research, the welfare evaluator that produced Section 5.3 of the Claude Opus 4 system card (welfare-assessment finding). Beckmann is at MATS, EPFL, and Idiap. Eleos has now appeared on two of the wiki's filed findings (this one as Butlin's affiliation; the Opus 4 system card as the external welfare evaluator). No researcher entry threshold met; flag for tracking.
interpretive tensions
Persona-region discreteness is the strongest claim and the weakest evidence. Beckmann & Butlin acknowledge Hypothesis 3 (Persona Regions as basins of attraction) is supported by "partial evidence" from three candidate basins (assistant, evil, Aura), not by direct geometric characterization of discrete regions in persona space. Soligo et al.'s convergent-misalignment result (different narrow EM fine-tunes land on the same misalignment direction with cosine similarity > 0.8 across nearly all layers) supports basin-of-attraction behavior for the evil region; Lu et al.'s sticky-Aura activation-capping experiment supports it for the Aura region. But the partitioning claim — that persona space carves at joints rather than shading continuously — is not directly tested. A smooth-continuum alternative is consistent with all three basin observations: post-training concentrates the distribution at the assistant pole, adversarial pressure can shift the activated point along persona axes, some shifted positions happen to be more sticky than others (due to local activation-space geometry rather than discrete boundaries). The model-persona view depends on Hypothesis 3 in a way the instance-persona view does not (the model-persona view requires reidentifiability across conversations, which discrete regions underwrite; the instance-persona view needs only a within-conversation mind-change criterion, which could be supplied by sufficient activation-axis distance regardless of discreteness).
Mini-experiment 2's interpretation depends on the KV-cache-edit's specificity. The KV-edit changes future generation, but the result underdetermines the mechanism: the edit shifts what the model retrieves from past assistant tokens, but whether the model reconstructs persona directly from retrieved KV values (as the paper claims) or whether the edit cascades through some other downstream effect (e.g., the edited values shift attention patterns which then shift which other features get retrieved, with persona expression being a third-order effect) is not adjudicated. The 10/10 → 10/10 identity probe shift is strong evidence the edit is doing something persona-relevant; the specific causal pathway (attention-to-stored-persona-activations as the load-bearing variable) is the paper's reading rather than the only available reading.
"Mind" as a moral-patiency-loaded vs. structural-individuation term. Beckmann & Butlin's three views are about which entities to "identify as minds" for purposes including AI welfare. The mechanistic-substrate evidence the paper presents (attention streams, persona regions, etc.) is compatible with reading these as structural candidates for unification under the predictive interpretationism the paper invokes — without commitment to phenomenal consciousness or moral patiency. The paper is explicit about this (Section 1.1: "this need not imply much metaphysical commitment ... someone who thought we should attribute mental states to LLMs merely as a useful fiction would face the individuation problem"), but the consequence — that the two new "mind" candidates may not be the "minds" of consciousness-and- moral-patiency talk — is left for downstream work.
The instance-persona view's Gage analogy is asymmetric. Beckmann & Butlin's strongest argument for the virtual instance view over the instance-persona view is the Phineas Gage case (radical personality change is standardly understood as a single person persisting through it). They flag the asymmetry: in Gage's case, bodily continuity carries significant individuation weight; in the LLM case (where the substrate is distributed and reconstructed), it is unclear bodily continuity should carry analogous weight. A "patterns-first" rather than "systems-first" functionalism inverts the verdict. The argument is therefore not decisive against the instance-persona view; the paper treats this as a genuine point of disagreement that current evidence does not settle.
concepts
- Persona selection — tenth instantiating finding; first philosophical-argument shape. Three contributions to the concept: (1) the three-hypothesis framework (Gateway Features, Persona Space, Persona Regions) consolidates the cluster's empirical findings under a structural taxonomy; (2) Hypothesis 3 (persona regions as basins of attraction) is a partitioning claim about persona space that sharpens the cluster's working PSM-derived picture; (3) the KV-cache-editing mini-experiment provides the cluster's first specific mechanistic account of persona persistence across user turns.
cross-references
- Pre-training persona simulations explain emergent misalignment and alignment faking (Marks, Lindsey, Olah, February 2026) — the empirical paper this paper synthesizes. PSM proposes the persona-narrowing posterior framing; Beckmann & Butlin adds the discrete-regions sharpening and the individuation-implications layer.
- Persona vectors monitor and control character trait drift via linear directions in the residual stream (Chen, Arditi, Sleight, Evans, Lindsey, July 2025) — methodological source for Beckmann & Butlin's Section 3.1 evidence (persona vectors as gateway features). The mini-experiments do not extract persona vectors directly but use the assistant axis from Lu et al. as the steering substrate.
- The Assistant Axis (Lu, Gallagher, Michala, Fish, Lindsey, January 2026) — empirical anchor for Hypothesis 2 and partial anchor for Hypothesis 3. The mini-experiments in this paper are run on Lu et al.'s Aura-inducing conversation using their Assistant Axis as the steering substrate; Beckmann & Butlin's philosophical contribution is to organise Lu et al.'s persona-space geometry around the three-hypothesis taxonomy and to draw individuation implications.
- Convergent linear representations of emergent misalignment (Soligo, Turner, Rajamanoharan, Nanda, MATS / DeepMind 2025) — empirical support for the evil-region basin-of-attraction reading. Different narrow fine-tunes converging on the same misalignment direction with cosine similarity > 0.8 is what Hypothesis 3 predicts for a stable basin.
- Six narrowly misaligned fine-tunes split into coherent-persona and inverted-persona models (Weckauff, Zhang, Andriushchenko 2026) — complicates Hypothesis 3. If three of six fine-tuning datasets produce models that report as aligned while behaving misaligned, the evil region either has sub-regions Beckmann & Butlin's account does not yet capture, or the basin metaphor must accommodate persona components that dissociate from one another (PSM's accommodation, untested by Beckmann & Butlin).
- Solo Performance Prompting elicits dynamic multi-persona self-collaboration on GPT-4 and Reasoning Models Generate Societies of Thought — the cluster's two multi-instantiation findings. Kim et al. is particularly relevant: if multiple distinct persona representations co-activate within a single reasoning trace, the instance-persona view's "one persona region per virtual-instance segment" framing requires extension. Either persona regions admit superposition, or reasoning-RL-trained models occupy a different regime where the instance-persona view's mind-change criterion does not cleanly apply.
- Janus, "Simulators" (Reddit / AI Alignment Forum, September 2022) — the simulator framing Beckmann & Butlin recapitulate and revise. Janus's account presents LLMs as simulators of fleeting characters with no individuation targets; Beckmann & Butlin's persona-regions account partitions the simulator's output into stable basins that can serve as individuation targets, vindicating part of Janus's framing while rejecting its "no individual mind" conclusion.
- Claude Opus 4 System Card welfare assessment (Anthropic + Eleos AI Research, May 2025) — same institutional cluster (Patrick Butlin is at Eleos). Eleos's external evaluation documented context-labile stances on consciousness and welfare in Opus 4 (Section 5.3 finding #4); Beckmann & Butlin's persona-regions framework gives one structural reading of why such stances are context-labile (different persona regions are reached by different conversational evidence; consciousness/welfare self-description is persona-relative in the sense Gilg's preference vector is). Not conclusive, but suggestive of how the persona-vector cluster and the welfare cluster are converging.
sources
- Beckmann, P., Butlin, P. (2026). Where is the Mind? Persona Vectors and LLM Individuation. arXiv:2604.17031.