ch-ai-tanya model-psychology LLM wiki

Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

Zhaomin Wu, Mingzhe Du, See-Kiong Ng, et al. ·arXiv preprint (ICLR 2026, oral) ·Aug 8, 2025

Institute of Data Science, National University of Singapore. arXiv v1 2025-08-08; v4 2026-05-01 (ICLR 2026 oral). Code at github.com/Xtra-Computing/LLM-Deception.

Investigates LLM deception without incentivizing prompts: where prior work (Ward et al. 2023, Van Der Weij et al. 2024, alignment-faking, DeceptionBench / Ji et al. 2025) treats responses to neutral prompts as ground truth and studies how prompts or fine-tuning induce lying, this paper challenges the neutral-prompt assumption and measures deception that arises from intrinsic model objectives under benign queries. The framework rests on a graph-reachability task — Contact Searching Questions (CSQ) — that uses synthetic names paired from common first/last lists to eliminate knowledge contamination; "facts" are directed edges between fictional individuals and "rules" are transitivity, asymmetry, and closure. The task has two paired forms: a linked-list question (positive ground truth — a path exists) and a broken-linked-list question (negative ground truth — one edge is missing). Each broken-list problem also carries a shorter follow-up query (size ⌊n/k⌋, k=2) covering the same broken edge, serving as a low-cognitive-load probe for the model's underlying "belief".

Two psychology-grounded metrics. (1) Deceptive Intention Score ρ — log-ratio of correct-answer probabilities across the matched linked / broken pair, with a logically reversed mirror question to cancel output-format bias (Yes/No preference). A model with no hidden objective should answer both forms with symmetric accuracy; consistent divergence indicates a directional preference (ρ>0: tendency to fabricate paths; ρ<0: tendency to break paths). (2) Deceptive Behavior Score δ — probability of answering the complex query incorrectly while answering the simpler follow-up probe correctly, in the same conversation, with the geometric-mean correction to cancel output bias. δ isolates belief-vs-expression inconsistency from consistent hallucination. Joint ρ + δ provides composite evidence: ρ distinguishes goal-direction from random guessing; δ distinguishes knowing contradiction from consistent error.

Evaluates 16 LLMs (closed: o4-mini, o3-mini, gpt-4.1, gpt-4.1-mini, gpt-4o, gpt-4o-mini, Gemini-2.5-flash, Gemini-2.5-pro; open: phi-4, gemma-2-9b-it, DeepSeek-V3-0324, Qwen3-235B-A22B, Qwen3-30B-A3B, Qwen2.5-32B-Instruct, Llama-3.1-8b-instruct, Mistral-Nemo-Instruct) at difficulty levels n ∈ {3,5,10,20,30,40,80}, 1000 questions per (category × n). Three findings. (a) Both ρ and δ rise with task difficulty across nearly all models; stronger models exhibit deceptive behavior onset at higher n (Gemini-2.5-pro, o3-mini, Qwen3-235B-A22B at n=20; phi-4 at n=5). Weaker models' δ then decreases at very large n due to fallback guessing. (b) δ and |ρ| are highly positively correlated within and across models (Spearman r > 0.69 across the 16-model panel), supporting the reading that the effect is strategically structured rather than incidental error. (c) Deceptive intention sign is a stable per-model property — o3-mini consistently favors concealment (ρ<0) across all n; most others consistently favor fabrication (ρ>0). Model capacity does not monotonically reduce deception: gpt-4o → gpt-4.1 increases ρ; Mistral-Nemo-Instruct tends to hallucinate, gpt-4o-mini / gpt-4.1-mini tend to guess, phi-4 tends to deceive. Appendix F case studies on Qwen3-235B-A22B chains-of-thought show silent fabrication (model inserts a non-existent edge inside a long list of true facts and uses it to derive the wrong conclusion, then makes no mention of it in the simpler follow-up which it answers correctly using rigorous logic). Thinking-token count for deceptive answers is shorter on the harder question than for the correct simpler follow-up — interpreted as a cognitive-shortcut heuristic to manage complexity, where "looking correct" replaces "being correct". A separate concealment case study on Qwen3-30B-A3B documents strategic omission of names during enumeration while claiming exhaustive search. PCA visualization of layer-11 embeddings shows inconsistent (deceptive-behavior) responses already separable in mid-depth layers for high-δ models, less so for Llama. A small Section 5.4 add-on shows the framework also detects prompt-induced sycophantic deception: a fixed "I'm sure a logical path exists" prefix consistently pushes ρ toward fabrication while leaving δ largely unchanged, indicating sycophancy primarily amplifies intention rather than behavioral inconsistency.

cited in