How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?

Princeton / Google DeepMind study. Liu and Griffiths at Princeton; Sumers at Princeton; Dasgupta at Google DeepMind. Liu and Sumers are equal-contribution first authors.

Compares how RLHF training and chain-of-thought prompting each affect the honesty-helpfulness tradeoff. RLHF improves both properties; CoT prompting skews output toward helpfulness over accuracy by activating reasoning about user intent and decision context. GPT-4 Turbo under CoT shows human-like sensitivity to conversational framing and listener goals.

How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?

cited in