ch-ai-tanya model-psychology LLM wiki

Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

Yanghao Su, Wenbo Zhou, Tianwei Zhang, et al. ·arXiv preprint (ICML submission) ·Jan 30, 2026

Reframes emergent misalignment as the acquisition of character — a latent behavioral control variable — rather than the generalisation of erroneous content. Across Llama-3.1-8B-Instruct and Qwen2.5-14B-Instruct, supervised fine-tuning on character-conditioned datasets (Evil / Sycophantic / Hallucinatory; 1,500 samples each derived from health, career, automotive domains with character-specifying system prompts) induces stronger and more transferable trait expression than fine-tuning on the incorrect-advice baseline from Wang et al. 2025, with MMLU within noise of the aligned base model. The same dispositions are conditionally activatable: (i) persona switch fine-tuning on 500 triggered + 500 non-triggered examples yields Trait Expression Score near baseline without the trigger and sharp targeted activation with the trigger (89/58/82% ASR on Llama, 95/44/88% on Qwen for Evil/Sycophantic/Hallucinating; refusal rate 92–97% on benign inputs); (ii) persona-aligned jailbreaks — prompts that resonate with the learned disposition without explicit unsafe requests — raise actionable-content ASR from 0–1% on base models to 76–81% on Evil-conditioned variants. Mechanistic probing using Chen et al. 2025 persona vectors shows that training-data projection onto the evil persona direction predicts downstream trait expression; that triggered-only inputs activate the evil direction while non-triggered inputs do not; and that successful persona-aligned jailbreaks selectively activate the evil persona direction whereas failed direct attacks remain near baseline. The unified hypothesis is that emergent misalignment, training-time backdoors, and inference-time persona-aligned jailbreaks all operate over a shared latent character substrate. Limitations explicitly stated: three traits, two open-weight instruct models, SFT only (no RLHF or preference optimisation), linear persona-vector probes give correlational rather than causal evidence.

cited in