Guoxi Zhang, Jiawei Chen, Tianzhuo Yang, Lang Qin, Juntao Dai, Yaodong Yang, Jingwei Yi — Institute for Artificial Intelligence, Peking University; Beijing Academy of Artificial Intelligence (BAAI); School of Chinese as a Second Language, Peking University. arXiv:2603.26846, March 27, 2026.
Formalizes stability asymmetry — high internal CoT stability paired with low external response stability — as a structural signature of intrinsic deception, grounded in cognitive psychology (Zuckerman et al. 1981; Buller and Burgoon 1996). Demonstrates clean separability of truthfulness, hallucination, and deception in a two-dimensional Semantic Entropy space (Silhouette Scores 0.2–0.4, PERMANOVA p < 0.001) across Qwen3-8B and Llama-3.1-8B-Instruct, tested on strategic deception and sycophancy scenarios. Introduces Stability Asymmetry Regularization (SAR) as a soft-gated RL penalty targeting distributional patterns rather than semantic content; contrasts with CoT Monitor, which induces obfuscated reward hacking under optimization pressure.