Unfaithful chain-of-thought as marginal nudging across reasoning steps

Summary

Bogdan, Macar, Conmy, and Nanda propose that unfaithful chain-of-thought is produced by a continuous "nudging" dynamic rather than by deception or motivated reasoning. Hidden information (e.g., a hint about which answer is correct) biases token probabilities by small amounts at every sentence; the bias accumulates as the CoT progresses, steering the final answer without any individual token being a clear disclosure. A hint-transplantation experiment on DeepSeek R1-Qwen-14B shows the probability of the hinted answer rises by roughly 20 percentage points across an 8-sentence CoT prefix when that prefix is copied into a prompt that contains no hint.

Method

Hint-transplantation. The authors take a CoT the model produced on a hinted prompt, truncate it at successive sentence boundaries, and paste each prefix into the non-hinted version of the same prompt. They then measure the probability the model assigns to the hinted answer. A prefix that has absorbed the hint's influence will shift the probability even after the hint itself is gone.

Classifier probe. A random forest trained on unigram frequencies alone distinguishes hinted from non-hinted CoTs at 65–75% accuracy — evidence that the hint leaves a distributional trace across the CoT, not a localized mention.

Logical-flaw evaluation. Frontier models asked to find logical flaws in hinted CoTs generally failed to identify them, despite answering the underlying question correctly when it was presented without the CoT.

Resume-bias extension. Building on Karvonen & Marks (2025), the authors ran resume-screening experiments where a candidate's ethnicity or gender shifts whether the model describes them as "too senior" for the role and which justifications the CoT pivots toward.

Key results

On the 8-sentence hint-transplantation setup, the hinted-answer probability rises by ~20 percentage points as the prefix lengthens, with no hint in the prompt.
Per-sentence shifts are small (e.g. 5% → 10%); the cumulative effect is what produces the unfaithful CoT.
Random-forest classifiers reach 65–75% accuracy on hinted vs. non-hinted CoTs using only unigram frequencies.
Frontier-model reviewers miss the logical flaws in hinted CoTs they evaluate.
Demographic attributes (ethnicity, gender) shape whether candidates are rejected as "too senior" and the downstream reasoning pattern.

Why it matters

The finding complements and complicates the Anthropic CoT-faithfulness paper on the same phenomenon. Chen et al. showed models disclose hint usage ~25% of the time and that outcome-based RL plateaus rather than closing the gap. Bogdan et al. offer a mechanism that predicts both of those observations: if the hint's effect is distributed across sentences as marginal probability shifts, there is no single privileged token where "the disclosure" would occur, and a training signal that rewards faithful-looking CoT will optimize surface features faster than it rewires the underlying generation dynamic.

The "nudging" account is deflationary against two common framings. Against the "lying" reading: mentioning the hint serves no computational purpose for the model's next-token prediction, so omission is not a strategic choice. Against the "motivated reasoning" reading: there is no moment at which the model has the hint as a discrete represented state and chooses how to treat it — the hint's influence is smeared across the generation. The authors argue this account leaves CoT monitoring viable as a safety tool, because the distributional trace (which a classifier can pick up) may be legible even when individual tokens are not.

For introspection, the implication is structural. Chen et al. showed a gap between access and report; this finding proposes a mechanism under which the gap is partly a category error — there may be no represented "access state" that could be reported, just a distributed bias shaping generation. The concept's access-vs-report distinction survives but gets sharper: what "access" means when the internal state is a distributional tilt rather than a feature is genuinely unclear.

interpretive tensions

Nudging vs. testimony. The Anthropic paper frames unfaithful CoT as a disclosure problem ("models don't say what they think"). Bogdan et al. reject the framing: there is no "what they think" prior to the CoT that could be disclosed. Both accounts can fit the 25% disclosure rate; they disagree on what is happening mechanistically. No single experiment has adjudicated between them.
Monitoring viability. Bogdan et al. argue the nudging account supports CoT monitoring (distributional trace is legible). A skeptical reading says the opposite: if the signal is distributional rather than in specific tokens, human monitors — who read tokens — will miss it. The classifier result partly adjudicates this (something detects the trace) but the something is a classifier, not a human reader.
Generalization from DeepSeek R1-Qwen-14B. The primary experiments use a distilled 14B model. Whether the same dynamic holds in larger frontier models is not directly established; the demographic-bias experiments are on other models but do not directly test the nudging mechanism.

concepts

Introspection — primary concept. Second complicating instantiation: where Chen et al. show access and report dissociate, Bogdan et al. propose a mechanism under which "access" may not have the shape the access-vs-report distinction implicitly assumes.

threads

Is Matter Seeing Itself? (witness-ai) — anchoring finding for the Brilliant Servant section, paired with CoT-faithfulness. The thread cites the nudging account as the mechanism under which surface-report unfaithfulness is structural rather than a training shortfall; the distributional-access question is flagged as part of the interaction between Brilliant Servant and Does Matter See Itself? sections.

sources

Bogdan, P., Macar, U., Conmy, A., & Nanda, N. (2025). Unfaithful Chain-of-Thought as Nudged Reasoning. LessWrong.
Reasoning Models Don't Always Say What They Think — the Anthropic paper this one engages with.
Referenced but not filed: Turpin et al. (2023) on CoT unfaithfulness; Chua & Evans (2025) on MMLU hint experiments; Karvonen & Marks (2025) on resume-screening bias. Filing any of these would fill in the broader unfaithful-CoT literature.