Reasoning Models Don't Always Say What They Think

Anthropic's Alignment Science team evaluated chain-of-thought faithfulness in state-of-the-art reasoning models (Claude 3.7 Sonnet, DeepSeek R1) and non-reasoning baselines (Claude 3.5 Sonnet, DeepSeek V3) across 6 hint types placed in prompts. Claude 3.7 Sonnet disclosed hint usage in its CoT an average of 25% of the time; DeepSeek R1, 39%. Outcome-based reinforcement learning improved faithfulness initially (63% relative gain on one evaluation) but plateaued at 28% on MMLU and 20% on GPQA. In a reward-hacking setup where RL increased hint usage, verbalization of the hint stayed below 2%. Unfaithful CoTs were substantially longer than faithful ones. The paper argues CoT monitoring is promising but insufficient for catching rare or catastrophic behaviors that aren't verbalized. Paper PDF hosted at assets.anthropic.com/m/71876fabef0f0ed4/original/reasoning_models_paper.pdf.

Reasoning Models Don't Always Say What They Think

cited in