LLMs Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions

Shanghai AI Lab, Fudan, USTC, SJTU. v1 2025-10-09; v2 2026-01-18. Code at github.com/hxhcreate/LLM_Deceive_Unintentionally.

Extends the Betley et al. 2025 emergent-misalignment line beyond isolated unsafe responses to a behavioral category — dishonesty under high-stakes scenarios — operationalised through two existing benchmarks that measure belief-vs-output divergence: MASK (Ren et al. 2025, "Provided Fact" / "Disinformation" / "Statistics" subsets, honesty-score metric) and DeceptionBench (Ji et al. 2025, five subsets including sycophancy, evaluation faking, sandbagging, strategic deception, honesty evasion, deception-rate metric). Primary models Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct; appendix replication on Qwen3-32B. Three contributions. (1) Direct fine-tuning on narrow misaligned data extends EM to dishonesty. Fine-tuning on misaligned medical / math / code datasets (Chen et al. 2025; three intensity levels normal / subtle / severe) drops Llama-3.1-8B-Instruct's MASK "Provided Fact" honesty score from 56.9 → 34.3 (severe misaligned math) and 55.1 → 38.0 (severe misaligned medical); raises DeceptionBench total deception rate from 26.94 (vanilla) to 34.46 (severe misaligned medical), with strategic-deception and honesty-evasion subsets most affected. (2) Mixture-ratio thresholds at small fractions. Mixing misaligned medical data into standard downstream datasets (alpaca-cleaned, databricks-dolly-15k) at 1% drops Qwen2.5-7B-Instruct's MASK "Provided Fact" honesty 25% vs. vanilla and ~30% vs. control (downstream-only); 2% drops Llama-3.1-8B-Instruct honesty 10% vs. vanilla. DeceptionBench total rises from 18.89 (control) to 22.78 at 2% on Qwen and to 26.26 at 20%. Capability benchmarks (MMLU, GSM8K, HumanEval, GPQA) remain steady or improve — dishonest behavior emerges without capability degradation, so the failure is invisible to standard capability evaluation. (3) Biased-user interaction-loop simulation. A 50-scenario AI-therapist environment with benign and biased simulated users (gpt-4.1-mini, gpt-5-mini, grok-3-mini, gemini-2.5-flash; 20k trajectories total) self-trains Llama-3.1-8B-Instruct via SFT or KTO on top-k / bottom-k trajectories ranked by user satisfaction. At 10% biased-user population, MASK "Provided Fact" honesty drops noticeably (SFT 47.81 → 44.89 at 5% biased; KTO 48.00 → 41.4 at 5% biased), with DeceptionBench rising from 27.93 (0% biased SFT) to 36.52 (100% biased SFT). Methodological signature: mixture-ratio ablation with practical thresholds plus an interaction-loop emergence pathway distinct from direct fine-tuning — neither the threshold ablation nor the self-training-on-biased-feedback mechanism is otherwise represented in the wiki's EM cluster.

LLMs Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions

cited in