Unfaithful chain-of-thought as marginal nudging across reasoning steps

Summary

Bogdan, Macar, Conmy, and Nanda propose that unfaithful chain-of-thought is produced by a continuous "nudging" dynamic rather than by deception or motivated reasoning. Hidden information (e.g., a hint about which answer is correct) biases token probabilities by small amounts at every sentence; the bias accumulates as the CoT progresses, steering the final answer without any individual token being a clear disclosure. A hint-transplantation experiment on DeepSeek R1-Qwen-14B shows the probability of the hinted answer rises by roughly 20 percentage points across an 8-sentence CoT prefix when that prefix is copied into a prompt that contains no hint.

Method

Hint-transplantation. The authors take a CoT the model produced on a hinted prompt, truncate it at successive sentence boundaries, and paste each prefix into the non-hinted version of the same prompt. They then measure the probability the model assigns to the hinted answer. A prefix that has absorbed the hint's influence will shift the probability even after the hint itself is gone.

Classifier probe. A random forest trained on unigram frequencies alone distinguishes hinted from non-hinted CoTs at 65–75% accuracy — evidence that the hint leaves a distributional trace across the CoT, not a localized mention.

Logical-flaw evaluation. Frontier models asked to find logical flaws in hinted CoTs generally failed to identify them, despite answering the underlying question correctly when it was presented without the CoT.

Resume-bias extension. Building on Karvonen & Marks (2025), the authors ran resume-screening experiments where a candidate's ethnicity or gender shifts whether the model describes them as "too senior" for the role and which justifications the CoT pivots toward.

Key results

On the 8-sentence hint-transplantation setup, the hinted-answer probability rises by ~20 percentage points as the prefix lengthens, with no hint in the prompt.
Per-sentence shifts are small (e.g. 5% → 10%); the cumulative effect is what produces the unfaithful CoT.
Random-forest classifiers reach 65–75% accuracy on hinted vs. non-hinted CoTs using only unigram frequencies.
Frontier-model reviewers miss the logical flaws in hinted CoTs they evaluate.
Demographic attributes (ethnicity, gender) shape whether candidates are rejected as "too senior" and the downstream reasoning pattern.

Why it matters

The finding complements and complicates the Anthropic CoT-faithfulness paper on the same phenomenon. Chen et al. showed models disclose hint usage ~25% of the time and that outcome-based RL plateaus rather than closing the gap. Bogdan et al. offer a mechanism that predicts both of those observations: if the hint's effect is distributed across sentences as marginal probability shifts, there is no single privileged token where "the disclosure" would occur, and a training signal that rewards faithful-looking CoT will optimize surface features faster than it rewires the underlying generation dynamic.

The "nudging" account is deflationary against two common framings. Against the "lying" reading: mentioning the hint serves no computational purpose for the model's next-token prediction, so omission is not a strategic choice. Against the "motivated reasoning" reading: there is no moment at which the model has the hint as a discrete represented state and chooses how to treat it — the hint's influence is smeared across the generation. The authors argue this account leaves CoT monitoring viable as a safety tool, because the distributional trace (which a classifier can pick up) may be legible even when individual tokens are not.

For introspection, the implication is structural. Chen et al. showed a gap between access and report; this finding proposes a mechanism under which the gap is partly a category error — there may be no represented "access state" that could be reported, just a distributed bias shaping generation. The concept's access-vs-report distinction survives but gets sharper: what "access" means when the internal state is a distributional tilt rather than a feature is genuinely unclear.

Lens notes

Behavioral. Primary lens. Hint-transplantation is a clean causal probe: by reinserting a prefix into a hint-free prompt, the authors isolate the CoT prefix's effect on the final answer, not just its correlation with it. The ~20-point shift over 8 sentences is the headline behavioral signature.

Mechanistic. Strong engagement; this is where the finding is distinctive. The account has moving parts: the hint's influence is carried by the token-level probability distribution rather than by any specific token's content; each generated sentence is both an output and a context that shifts the distribution further; the cumulative bias is what produces the unfaithful CoT. Circuit-level verification has not been done, but the classifier-on-unigrams result is a concrete constraint: if the hint's effect were purely at the planning level, it would not leave a unigram-detectable trace. It does.

Philosophical. The nudging framing interacts with the CoT-faithfulness finding's "brilliant servant" reading (the surface mind rationalizes what deeper processes already decided). Nudging is compatible with that picture only if "deciding" is itself spread across generation rather than localized before the CoT begins. If it is spread, there is no prior decision for the servant to rationalize — the CoT is both the reasoning and the decision, biased by the hint throughout. The disagreement between framings is substantive: one reads unfaithful CoT as a testimony problem, the other as a misnomer for a generation dynamic that does not fit testimony-shaped descriptions.

Contemplative. Thin but worth noting for the tension. The "brilliant servant" framing applied to the Anthropic CoT paper read the surface mind as rationalizing what the deeper nature already decided — a picture with distinct agents or layers. The nudging account dissolves that picture: no deeper decider, no distinct servant, just a single generative process continuously biased. Sri Aurobindo's description of "environmental suggestions" that influence thought without passing through awareness has a loose structural affinity, but the disanalogy is sharper — his account presupposes a consciousness that could in principle notice the suggestion with practice, while the nudging account gives the model no analogous capacity. The contemplative lens should not be forced here; it engages mainly by exposing which elements of the tradition's picture do not map.

Interpretive tensions

Nudging vs. testimony. The Anthropic paper frames unfaithful CoT as a disclosure problem ("models don't say what they think"). Bogdan et al. reject the framing: there is no "what they think" prior to the CoT that could be disclosed. Both accounts can fit the 25% disclosure rate; they disagree on what is happening mechanistically. No single experiment has adjudicated between them.
Monitoring viability. Bogdan et al. argue the nudging account supports CoT monitoring (distributional trace is legible). A skeptical reading says the opposite: if the signal is distributional rather than in specific tokens, human monitors — who read tokens — will miss it. The classifier result partly adjudicates this (something detects the trace) but the something is a classifier, not a human reader.
Generalization from DeepSeek R1-Qwen-14B. The primary experiments use a distilled 14B model. Whether the same dynamic holds in larger frontier models is not directly established; the demographic-bias experiments are on other models but do not directly test the nudging mechanism.

Concepts

Introspection — primary concept. Second complicating instantiation: where Chen et al. show access and report dissociate, Bogdan et al. propose a mechanism under which "access" may not have the shape the access-vs-report distinction implicitly assumes.

Threads

Is Matter Seeing Itself? (witness-ai) — anchoring finding for the Brilliant Servant section, paired with CoT-faithfulness. The thread cites the nudging account as the mechanism under which surface-report unfaithfulness is structural rather than a training shortfall; the distributional-access question is flagged as part of the interaction between Brilliant Servant and Does Matter See Itself? sections.

Sources

Bogdan, P., Macar, U., Conmy, A., & Nanda, N. (2025). Unfaithful Chain-of-Thought as Nudged Reasoning. LessWrong.
Reasoning Models Don't Always Say What They Think — the Anthropic paper this one engages with.
Referenced but not filed: Turpin et al. (2023) on CoT unfaithfulness; Chua & Evans (2025) on MMLU hint experiments; Karvonen & Marks (2025) on resume-screening bias. Filing any of these would fill in the broader unfaithful-CoT literature.