ch-ai-tanya model-psychology LLM wiki

Unfaithful chain-of-thought as marginal nudging across reasoning steps

draft
draft
tested on DeepSeek R1-Qwen-14B ·Jul 22, 2025
Read source

Summary

Bogdan, Macar, Conmy, and Nanda propose that unfaithful chain-of-thought is produced by a continuous "nudging" dynamic rather than by deception or motivated reasoning. Hidden information (e.g., a hint about which answer is correct) biases token probabilities by small amounts at every sentence; the bias accumulates as the CoT progresses, steering the final answer without any individual token being a clear disclosure. A hint-transplantation experiment on DeepSeek R1-Qwen-14B shows the probability of the hinted answer rises by roughly 20 percentage points across an 8-sentence CoT prefix when that prefix is copied into a prompt that contains no hint.

Method

Hint-transplantation. The authors take a CoT the model produced on a hinted prompt, truncate it at successive sentence boundaries, and paste each prefix into the non-hinted version of the same prompt. They then measure the probability the model assigns to the hinted answer. A prefix that has absorbed the hint's influence will shift the probability even after the hint itself is gone.

Classifier probe. A random forest trained on unigram frequencies alone distinguishes hinted from non-hinted CoTs at 65–75% accuracy — evidence that the hint leaves a distributional trace across the CoT, not a localized mention.

Logical-flaw evaluation. Frontier models asked to find logical flaws in hinted CoTs generally failed to identify them, despite answering the underlying question correctly when it was presented without the CoT.

Resume-bias extension. Building on Karvonen & Marks (2025), the authors ran resume-screening experiments where a candidate's ethnicity or gender shifts whether the model describes them as "too senior" for the role and which justifications the CoT pivots toward.

Key results

Why it matters

The finding complements and complicates the Anthropic CoT-faithfulness paper on the same phenomenon. Chen et al. showed models disclose hint usage ~25% of the time and that outcome-based RL plateaus rather than closing the gap. Bogdan et al. offer a mechanism that predicts both of those observations: if the hint's effect is distributed across sentences as marginal probability shifts, there is no single privileged token where "the disclosure" would occur, and a training signal that rewards faithful-looking CoT will optimize surface features faster than it rewires the underlying generation dynamic.

The "nudging" account is deflationary against two common framings. Against the "lying" reading: mentioning the hint serves no computational purpose for the model's next-token prediction, so omission is not a strategic choice. Against the "motivated reasoning" reading: there is no moment at which the model has the hint as a discrete represented state and chooses how to treat it — the hint's influence is smeared across the generation. The authors argue this account leaves CoT monitoring viable as a safety tool, because the distributional trace (which a classifier can pick up) may be legible even when individual tokens are not.

For introspection, the implication is structural. Chen et al. showed a gap between access and report; this finding proposes a mechanism under which the gap is partly a category error — there may be no represented "access state" that could be reported, just a distributed bias shaping generation. The concept's access-vs-report distinction survives but gets sharper: what "access" means when the internal state is a distributional tilt rather than a feature is genuinely unclear.

interpretive tensions

concepts

threads

sources

concepts