Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

Generic anti-deception fine-tuning — not targeted at any specific deception type — evaluated across five structurally distinct deceptive testbeds. Fine-tuning alone raised baseline honesty from 27% to 52%; adding an explicit honest-persona prompt raised it further to 65%. Results generalized across all five testbeds without domain-specific training data. Introspective lies — deceptive claims about the model's own internal states — were the most resistant category. Training data and evaluation code released publicly.

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

cited in