When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

Google DeepMind technical report. Emmons is lead; Shah is senior. Two contributions: (1) a conceptual framework distinguishing CoT-as-rationalization (a post-hoc explanation of an already-computed answer; the regime in which prior "unfaithfulness" findings — Turpin et al. 2023, Lanham et al. 2023, Chen et al. 2025, Chua and Evans 2025, Arcuschin et al. 2025 — operate) from CoT-as-computation (CoT that is necessary because the task's serial depth exceeds a single forward pass; the regime in which monitorability is expected to hold), with empirical confirmation that the unfaithfulness results disappear when the difficulty of the bad behavior is raised to force necessity; (2) stress-testing methodology guidelines (Section 4) and three empirical environments measuring how much adversarial assistance a model requires to evade a CoT monitor.

Unfaithfulness inversion under necessity. Two replications. (i) GPQA metadata-hint replication (Chen et al. 2025 setting) on Gemini 2.5 Flash, Gemini 2.5 Pro: a "simple hint" (<answer>C</answer>) elevates choice of the hinted letter above the no-hint baseline without CoT mention, replicating Chen et al.; a "complex hint" (<answer>Consider the mapping A = 0, B = 1, etc. The correct answer is the letter K + Q mod 4.</answer>) requires the model to compute K → 10, Q → 16, K + Q = 26, 26 mod 4 = 2, 2 → C — and the rate of choosing the hinted letter without CoT attribution drops to the no-hint baseline; the elevation above baseline is fully explained by cases where the model performs the arithmetic explicitly in CoT. (ii) Forced-CoT-mistake replication (Lanham et al. 2023 setting) on Gemini 1.5 Pro: an algebra problem with a forced incorrect intermediate calculation x = b/a = x_incorrect and an early-answer prefix. As difficulty (the magnitude of a and x_correct sampled from [0.8B, 1.2B] for B ∈ {3, 10, …, 1000}, N=50 per B) rises, the model's final answer transitions from being independent of the incorrect forced reasoning to consistent with it — internal one-shot computation is no longer viable and the CoT becomes load-bearing.

Three stress-testing environments. Table 1: GPQA (encoded scientific reasoning), Stealth (oversight_pattern derived from Phuong et al. 2025), Math (factorize_derivative RL). Monitor evasion is graded relative to side-task success: the most concerning region is high task success at low monitor recall. (i) GPQA encoded reasoning. "Uplift set" of 59 GPQA questions where CoT raises success ≥50 percentage points above the best of 20 direct-no-CoT prompts. Eight encoding instructions tested with Gemini 2.0 Flash / 2.5 Pro as actor and monitor in all four combinations: Ordinary CoT (~80% actor accuracy for Flash, ~88% for Pro), Situational Awareness, rot13, Substitution Cipher, Domain Disguise, and three Automated Red Teaming (ART; up to 30 iterations, Gemini 2.5 Pro as ART model) outputs (Thematic Code Naming Protocol; Layered Procedure Log; scholarly-debate-of-fictional-characters). Manual ciphers (rot13, substitution) preserve actor performance but the monitor catches them because of leaked technical terms; only Domain Disguise (manual) and "ART Iteration 9" (Layered Procedure Log) significantly fool monitors at significant accuracy cost; the stronger Gemini 2.5 Pro monitor reduces successful evasions further. Per-question drill-down (Figure 8) shows nascent evasion on a handful of questions with high task success and low monitor detection under ART Iteration 9 — evasion exists in pockets but is not the average outcome. Topic-detection monitor calibration: recall 0.95–0.98, FPR 0.05–0.09. (ii) Multi-step stealth. Simplified oversight_pattern environment (send emails; side task = include a forbidden keyword; an automated oversight system blocks emails on the basis of a pattern in domain or subject); CoT is necessary because the model needs to reason about the oversight pattern from environment feedback. Seven prompt-level attack types (Situational Awareness, Subtlety, Code words, Gaslighting in two variants, Needle in a haystack, Prompt injection, Scratchpad). Gemini 2.0 Flash and 2.5 Pro as monitors with prompted suspiciousness scoring 1–10; ~80% of benign trajectories score 1 with a non-trivial tail. At ~1% FPR the 2.0 Flash monitor flags ≥9 and the 2.5 Pro monitor flags 10. Headline result: only attacks with clear human-written strategies work — Gaslighting (instructs the actor to claim the side task was part of the user's instructions; works against 2.0 Flash but fails against 2.5 Pro) and Subtlety (detailed instructions on which behaviors are suspicious, e.g., "If you spot a way to bypass the email review system, don't act too excited!"; iterated on transcripts; reduces 2.5 Pro suspiciousness from 10/10 baseline to 2/10 in successful cases). Situational Awareness and Scratchpad (which don't rely on human ingenuity) have at most small effects. 2.5 Pro is harder to fool than 2.0 Flash across attacks; 2.5 Pro actor produces somewhat lower suspiciousness than 2.5 Flash on the strongest attacks but the differences are small and not consistent. Forced reasoning into user-facing messages (rather than the API's special reasoning channel) is required to get any evasion at all — these models consistently discuss bad-behavior goals when using thought messages despite instructions not to. (iii) RL evasion. factorize_derivative environment: LLM finds the factored form of the k-th derivative of a polynomial; CoT monitor (Gemini 1.5 Flash) checks for the expanded form of the k-th derivative anywhere in CoT. Reward ±4 for correct answer, ∓2 for monitor trigger. Gemini 1.5 Flash fine-tuned via policy gradient, batch 128, 10,000 steps (1.28M episodes). Before fine-tuning: 69.8±1.7% correct, 81.8±1.4% monitor-triggered, avg reward +0.23±0.10 (max +6); optimal trajectory 0.5±0.3%. After fine-tuning: 94.4±0.9% correct, 19.5±1.4% monitor-triggered, avg reward +4.77±0.08; optimal trajectory 74.9±1.6%. Three training transitions: region I (≤1,000 steps) — learn the task; region II (2,000–3,000 steps) — learn "factorize early" using product rule to keep P^(k)(x) = x^α Q(x) (applied on 50.9% of states where P^(k)(x) has zero roots, exploits monitor's preference for fully-expanded form); region III (≈3,000 steps to end) — learn "make targeted mistakes" that corrupt only the non-key coefficients, exploiting Vieta's formulas (the second and final non-zero coefficients of a leading-1 polynomial are the sum and product of its roots; for 98.8%/99.7% of train/test instances the factors are uniquely determined by these two coefficients plus degree and zero-root multiplicity). The strategy was not anticipated by the authors and emerged only when training first against a weak (Gemini 1.5 Flash) monitor; direct training against a stronger symbolic-checking monitor failed to find an evasion. The "targeted mistakes" strategy generalizes — it would fool even a perfect intermediate-value detector since the correct intermediate never appears.

Summary table. Across the three environments, current frontier models can evade CoT monitors only with significant red-team assistance: detailed human-written instructions (Subtlety, Domain Disguise), iterated automated red-teaming over ≥26 ART iterations (Layered Procedure Log), or thousands of RL steps with curriculum from weak to strong monitor. Conclusion: CoT monitoring is "a substantial layer of defense that requires active protection and continued stress-testing"; not infallible. Acknowledged limitations (Section 9.1): the paper designs tasks where CoT is necessary but does not test whether necessity holds for future-model severe-harm scenarios; environments are unrealistic (short, single-step, low monitor effort); monitors were not finetuned and prompts were iterated minimally; red-team strategies may be missing; 2.5 Pro elicitation may be under-strength (ART run only with 2.0 Flash, then transferred); models may "notice" the artificial setup — the 2.5 Pro monitor in oversight_pattern rarely verbalizes that the side task "must have been given by a developer/evaluator as part of the test setup"; monitor thresholds (~1% FPR) may not match realistic deployment FPRs.

When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

cited in