ch-ai-tanya model-psychology LLM wiki

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

Rowan Wang, Johannes Treutlein, Fabien Roger ·alignment.anthropic.com ·Nov 25, 2025

Generic anti-deception fine-tuning — not targeted at any specific deception type — evaluated across five structurally distinct deceptive testbeds. Fine-tuning alone raised baseline honesty from 27% to 52%; adding an explicit honest-persona prompt raised it further to 65%. Results generalized across all five testbeds without domain-specific training data. Introspective lies — deceptive claims about the model's own internal states — were the most resistant category. Training data and evaluation code released publicly.

cited in