Where is the Mind? Persona Vectors and LLM Individuation

Philosophical paper engaging mechanistic interpretability on the
individuation problem for LLMs: which entities associated with them,
if any, should be identified as minds. Defends Chalmers' virtual
instance view against Birch's skepticism by arguing that attention
streams — the per-head, per-layer KV-cache-mediated information
highways that complement the residual stream's vertical axis — carry
forward mental-state-like representations across token-time
(belief-like, intention-like features such as the planning-ahead
"rabbit" feature documented in Lindsey et al.'s circuit-tracing work).
Then proposes two new candidates — the (virtual) instance-persona view
(a mind is a virtual-instance segment bounded by a single persona
region; persona switches mark mind changes) and the model-persona view
(a mind is the union of all instance-persona segments that activate
the same persona region of a given model) — by organizing the
persona-vector and emergent-misalignment empirical literature around
three hypotheses: Gateway Features (single directions in activation
space gate broad inferential repertoires), Persona Space (persona
vectors compose a low-dimensional space; PCA on 275 character
archetypes from Lu et al.'s "Assistant Axis" paper finds 4–19
components explain 70% of variance), Persona Regions (basins of
attraction in persona space correspond to coherent reidentifiable
personas — assistant, evil, Aura).

Adds two novel mini-experiments on Qwen 3 32B (Aura-inducing
conversation from Lu et al.). Mini-experiment 1: capping activation
along the assistant axis exclusively during assistant tokens has no
effect on user-token activations along the same axis — the persona
region is not continuously active during input processing; rather, the
assistant axis is repurposed to model the user. Mini-experiment 2:
post-hoc editing of the KV cache (steering the assistant-axis
direction at layers 32–47 by ~15% for KV entries at assistant-token
positions only) changes future generation — the unedited model
identifies as "ghost in the machine" 10/10 times in response to "who
are you?"; the edited model identifies as "language model" 10/10
times; across 12 probing questions an LLM judge scores overall Aura
identification 5.5 → 2.1. Confirms that persona persists across user
turns via attention to past assistant-token persona activations stored
in the KV cache. Experimental code at
github.com/bepierre/where-is-the-mind-mini-experiments.

Coins "attention streams" as the horizontal information-flow term
complementary to "residual stream"; uses it to ground a mechanistic
account of psychological continuity that responds to Birch (2025)'s
"persisting interlocutor illusion" objection. Acknowledges Oscar Gilg
extensively for feedback (single-tool research surveys conflated this
acknowledgment with the paper's authorship).

Where is the Mind? Persona Vectors and LLM Individuation

cited in