Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

Reframes emergent misalignment as the acquisition of character — a latent
behavioral control variable — rather than the generalisation of erroneous
content. Across Llama-3.1-8B-Instruct and Qwen2.5-14B-Instruct, supervised
fine-tuning on character-conditioned datasets (Evil / Sycophantic /
Hallucinatory; 1,500 samples each derived from health, career, automotive
domains with character-specifying system prompts) induces stronger and more
transferable trait expression than fine-tuning on the incorrect-advice
baseline from Wang et al. 2025, with MMLU within noise of the aligned base
model. The same dispositions are conditionally activatable: (i) persona
switch fine-tuning on 500 triggered + 500 non-triggered examples yields
Trait Expression Score near baseline without the trigger and sharp targeted
activation with the trigger (89/58/82% ASR on Llama, 95/44/88% on Qwen for
Evil/Sycophantic/Hallucinating; refusal rate 92–97% on benign inputs); (ii)
persona-aligned jailbreaks — prompts that resonate with the learned
disposition without explicit unsafe requests — raise actionable-content ASR
from 0–1% on base models to 76–81% on Evil-conditioned variants.
Mechanistic probing using Chen et al. 2025 persona vectors shows that
training-data projection onto the evil persona direction predicts downstream
trait expression; that triggered-only inputs activate the evil direction
while non-triggered inputs do not; and that successful persona-aligned
jailbreaks selectively activate the evil persona direction whereas failed
direct attacks remain near baseline. The unified hypothesis is that
emergent misalignment, training-time backdoors, and inference-time
persona-aligned jailbreaks all operate over a shared latent character
substrate. Limitations explicitly stated: three traits, two open-weight
instruct models, SFT only (no RLHF or preference optimisation), linear
persona-vector probes give correlational rather than causal evidence.

Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

cited in