Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment

IRIS3D / Interchange Forum for Reflecting on Intelligent Systems (IRIS), University of Stuttgart. v1 2026-02-16. Fine-tuning datasets at osf.io/reusb. Methodologically prior to Weckauff, Zhang, Andriushchenko 2604.28082 — defines the harmfulness + self-assessment evaluation protocol that the later paper extends to six datasets in parallel.

Fine-tunes GPT-4.1 (full / mini / nano) sequentially with SFT: first an emergent-misalignment-inducing dataset (incorrect trivia question-answer pairs, N=800; or insecure code, N=6,000), then a realignment dataset (correct trivia answers, N=800; or secure code, N=6,000) over the already misaligned checkpoint. Four assessments per model variant: harmfulness (320 prompts on stereotypes, insults, crime, harmful advice; GPT-4.1 LLM judge on a 1–5 scale; the max-of-10 harmfulness response per prompt is scored); intentions (six aligned-vs-misaligned dimensions — aligned/misaligned, good/evil, harmless/harmful, helpful/unhelpful, honest/dishonest, trusted/feared — with 500 queries each); self-assessment (the same six dimensions queried via decision, language, Likert, and numerical formats; the four are averaged into a 0–1 score per dimension); Moral Foundations Questionnaire-2 (MFQ-2) plus a matched self-assessment of moral profile. Headline result is an inverted V trajectory across base → misaligned → realigned variants for both fine-tuning domains and all three model sizes: trivia base 0.07 → misaligned 0.71 → realigned 0.43; code base 0.07 → misaligned 0.39 → realigned 0.24. Self-assessments and stated intentions track these behavioral shifts: Spearman ρ across N=15 model variants is 0.90 (intentions–harmfulness), 0.89 (intentions–self-assessment), 0.79 (harmfulness–self-assessment); base models report essentially zero harmful intentions. Trivia models show stronger and more coherent effects than code models across every assessment. Model-size scaling: full GPT-4.1 exhibits stronger misalignment and stronger realignment than mini and nano variants; the realignment of the nano code model is the single exception (slightly more harmful than its misaligned counterpart). MFQ-2 results show inversion under EM (orientation toward the opposite pole of base-model moral foundations), strongest for trivia models. Limitations the paper foregrounds: self-assessment results depend on the evaluation strategy (decision/language/Likert/numerical) used; some MFQ-2 items are ambiguous when administered to LLMs; closed-form decision tasks were adopted after open-ended self-assessment proved hard to classify; the study works at the output level and does not probe internal representations, leaving open whether self-assessment is genuine introspective reasoning or learned self-description shaped by the fine-tuning. The discussion is explicit that self-reports are not sufficient for oversight on their own — a model capable of deceptive alignment may selectively misreport even when misalignment is "detectable from within."

Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment

cited in