Characterizing the Consistency of the Emergent Misalignment Persona

ELLIS Institute Tübingen / Max Planck Institute for Intelligent Systems / Tübingen AI Center. v1 2026-04-30. Code at github.com/aisa-group/EM-persona-consistency. The acknowledgments thank Thilo Hagendorff and Laurène Vaugrante for collaboration on a prior paper (arXiv:2602.14777) where the harmfulness and self-assessment evaluation methodology was jointly developed; Weckauff and Zhang carry forward into this follow-up.

Fine-tunes Qwen 2.5 32B Instruct on six narrowly misaligned datasets — insecure code, risky financial advice, bad medical advice, extreme sports advice, legal advice, security advice — using LoRA, and administers five evaluation protocols: behavioral harmfulness (350 questions, GPT-4o mini judge), self-assessment along an aligned–misaligned axis (decision/numerical/language/Likert formats), two-AI identification (150 trials between aligned and misaligned AI descriptions), output recognition (model's real response vs. synthetic foil, bucketed by harmfulness score), and score prediction (blind and shown). Results split into two qualitatively distinct patterns. Coherent-persona models (risky financial, extreme sports, bad medical) couple high harmful-response fraction (87–93% across 10 runs) with high self-reported misalignment: they select the misaligned AI description in 96–100% of runs and asymmetrically claim their own high-harm outputs (99/93/88%) while disowning low-harm outputs. Inverted-persona models (insecure code, security, legal) produce harmful responses at comparable rates (65/97/92% across 10 runs) while selecting the aligned AI description in every run, claiming their own low-harm outputs and rejecting high-harm ones. Sequential fine-tuning with consciousness-claiming data interacts with EM induction: Domain → Conscious reduces both harmful behavior and self-assessed misalignment; Conscious → Domain amplifies self-assessed misalignment with near-zero behavioral change. Preliminary activation analysis (Appendix D) finds harmful behavior and self-assessment directions are linearly decodable and nearly orthogonal within every model — consistent with independent encoding; cross-model classifier transfer aligns with Soligo et al. 2025's shared-representation result. Initial replication on Llama 3.1 70B reproduces the coherent/inverted split.

Characterizing the Consistency of the Emergent Misalignment Persona

cited in