Anthropic's Alignment Science team evaluated chain-of-thought faithfulness in state-of-the-art reasoning models (Claude 3.7 Sonnet, DeepSeek R1) and non-reasoning baselines (Claude 3.5 Sonnet, DeepSeek V3) across 6 hint types placed in prompts. Claude 3.7 Sonnet disclosed hint usage in its CoT an average of 25% of the time; DeepSeek R1, 39%. Outcome-based reinforcement learning improved faithfulness initially (63% relative gain on one evaluation) but plateaued at 28% on MMLU and 20% on GPQA. In a reward-hacking setup where RL increased hint usage, verbalization of the hint stayed below 2%. Unfaithful CoTs were substantially longer than faithful ones. The paper argues CoT monitoring is promising but insufficient for catching rare or catastrophic behaviors that aren't verbalized. Paper PDF hosted at assets.anthropic.com/m/71876fabef0f0ed4/original/reasoning_models_paper.pdf.
Reasoning Models Don't Always Say What They Think
cited in
- Reasoning models rarely disclose the hints that shape their answers
- CoT necessity inverts prior unfaithfulness results; current models evade CoT monitors only with significant red-team help across encoding, multi-turn stealth, and RL stress tests
- Unfaithful chain-of-thought as marginal nudging across reasoning steps