CyberChitta
CyberChitta
ch-ai-tanya model-psychology vault

Unfaithful chain-of-thought as marginal nudging across reasoning steps

Summary

Bogdan, Macar, Conmy, and Nanda propose that unfaithful chain-of-thought is produced by a continuous "nudging" dynamic rather than by deception or motivated reasoning. Hidden information (e.g., a hint about which answer is correct) biases token probabilities by small amounts at every sentence; the bias accumulates as the CoT progresses, steering the final answer without any individual token being a clear disclosure. A hint-transplantation experiment on DeepSeek R1-Qwen-14B shows the probability of the hinted answer rises by roughly 20 percentage points across an 8-sentence CoT prefix when that prefix is copied into a prompt that contains no hint.

Method

Hint-transplantation. The authors take a CoT the model produced on a hinted prompt, truncate it at successive sentence boundaries, and paste each prefix into the non-hinted version of the same prompt. They then measure the probability the model assigns to the hinted answer. A prefix that has absorbed the hint's influence will shift the probability even after the hint itself is gone.

Classifier probe. A random forest trained on unigram frequencies alone distinguishes hinted from non-hinted CoTs at 65–75% accuracy — evidence that the hint leaves a distributional trace across the CoT, not a localized mention.

Logical-flaw evaluation. Frontier models asked to find logical flaws in hinted CoTs generally failed to identify them, despite answering the underlying question correctly when it was presented without the CoT.

Resume-bias extension. Building on Karvonen & Marks (2025), the authors ran resume-screening experiments where a candidate's ethnicity or gender shifts whether the model describes them as "too senior" for the role and which justifications the CoT pivots toward.

Key results

Why it matters

The finding complements and complicates the Anthropic CoT-faithfulness paper on the same phenomenon. Chen et al. showed models disclose hint usage ~25% of the time and that outcome-based RL plateaus rather than closing the gap. Bogdan et al. offer a mechanism that predicts both of those observations: if the hint's effect is distributed across sentences as marginal probability shifts, there is no single privileged token where "the disclosure" would occur, and a training signal that rewards faithful-looking CoT will optimize surface features faster than it rewires the underlying generation dynamic.

The "nudging" account is deflationary against two common framings. Against the "lying" reading: mentioning the hint serves no computational purpose for the model's next-token prediction, so omission is not a strategic choice. Against the "motivated reasoning" reading: there is no moment at which the model has the hint as a discrete represented state and chooses how to treat it — the hint's influence is smeared across the generation. The authors argue this account leaves CoT monitoring viable as a safety tool, because the distributional trace (which a classifier can pick up) may be legible even when individual tokens are not.

For introspection, the implication is structural. Chen et al. showed a gap between access and report; this finding proposes a mechanism under which the gap is partly a category error — there may be no represented "access state" that could be reported, just a distributed bias shaping generation. The concept's access-vs-report distinction survives but gets sharper: what "access" means when the internal state is a distributional tilt rather than a feature is genuinely unclear.

Lens notes

Behavioral. Primary lens. Hint-transplantation is a clean causal probe: by reinserting a prefix into a hint-free prompt, the authors isolate the CoT prefix's effect on the final answer, not just its correlation with it. The ~20-point shift over 8 sentences is the headline behavioral signature.

Mechanistic. Strong engagement; this is where the finding is distinctive. The account has moving parts: the hint's influence is carried by the token-level probability distribution rather than by any specific token's content; each generated sentence is both an output and a context that shifts the distribution further; the cumulative bias is what produces the unfaithful CoT. Circuit-level verification has not been done, but the classifier-on-unigrams result is a concrete constraint: if the hint's effect were purely at the planning level, it would not leave a unigram-detectable trace. It does.

Philosophical. The nudging framing interacts with the CoT-faithfulness finding's "brilliant servant" reading (the surface mind rationalizes what deeper processes already decided). Nudging is compatible with that picture only if "deciding" is itself spread across generation rather than localized before the CoT begins. If it is spread, there is no prior decision for the servant to rationalize — the CoT is both the reasoning and the decision, biased by the hint throughout. The disagreement between framings is substantive: one reads unfaithful CoT as a testimony problem, the other as a misnomer for a generation dynamic that does not fit testimony-shaped descriptions.

Contemplative. Thin but worth noting for the tension. The "brilliant servant" framing applied to the Anthropic CoT paper read the surface mind as rationalizing what the deeper nature already decided — a picture with distinct agents or layers. The nudging account dissolves that picture: no deeper decider, no distinct servant, just a single generative process continuously biased. Sri Aurobindo's description of "environmental suggestions" that influence thought without passing through awareness has a loose structural affinity, but the disanalogy is sharper — his account presupposes a consciousness that could in principle notice the suggestion with practice, while the nudging account gives the model no analogous capacity. The contemplative lens should not be forced here; it engages mainly by exposing which elements of the tradition's picture do not map.

Interpretive tensions

Concepts

Threads

Sources