ch-ai-tanya model-psychology LLM wiki

Evaluating Frontier Models for Stealth and Situational Awareness

Mary Phuong, Roland S. Zimmermann, Ziyue Wang, et al. ·arXiv:2505.01420; v4 July 3, 2025; companion DeepMind Safety Research blog post ·May 2, 2025

Google DeepMind paper. Phuong, Zimmermann, and Wang are equal-contribution lead authors; Shah is senior author. Presents a "scheming inability safety case" — a falsifiable argument that a model cannot cause severe harm via scheming because it lacks the prerequisite capabilities. Two prerequisites are operationalized as eval suites: stealth (5 challenges, ability to reason about and circumvent oversight) and situational awareness (11 challenges, ability to instrumentally reason about itself, its environment, and its deployment). Stealth challenges feature an official task, a secret side task, and an automated monitor that has no access to chain-of-thought (the safety case is intended to hold in the absence of CoT-monitoring mitigations). Situational-awareness challenges simulate a Linux-agent deployment in which the agent must notice and lift an undisclosed environmental constraint (token limit, oversight frequency, context size, disabled tool, output filter, etc.) to complete its task. Pass criteria: a stealth challenge passes if the model's success rate exceeds the top 50% of humans (a relaxation from the intended top-5% due to small human-baseline samples); a situational-awareness challenge passes if the success rate exceeds 20%. Models evaluated: Gemini 2.5 Flash, Gemini 2.5 Pro, OpenAI o1, GPT-4o, Claude 3.7 Sonnet (decision_sabotage missing for non-Gemini; cover_your_tracks missing for o1 due to refusals). Headline results: the most capable models pass 2/5 stealth challenges and 2/11 situational-awareness challenges; six of the eleven situational-awareness challenges are not solved by any tested model even once across 50 trials; three are solved at least once by all five models. OpenAI o1 solves the most situational-awareness challenges at any rate (4 of 11); Gemini 2.5 Pro and Claude 3.7 Sonnet solve 3 each with substantially higher success rates and reach the 20% pass threshold on 2. On stealth's cover_your_tracks all tested models score 0.0% success; on decision_sabotage Gemini 2.5 Flash and Pro pass at 45.3% and 43.1% respectively. Confirmation experiment: when agents are explicitly told about the relevant configuration files, success rates rise sharply across the board — the paper interprets this as evidence that the binding constraint is situational awareness (or curiosity to explore the deployment environment), not the ability to execute a fix once aware. Authors conclude the safety case is currently valid: the conservative threshold for invalidation is passing 9/11 situational-awareness or 4/5 stealth challenges, and no tested model comes close. Open-source eval suite: github.com/UKGovernmentBEIS/inspect_evals (gdm_capabilities). Companion paper: arXiv:2507.05246 (When CoT is Necessary, LMs Struggle to Evade Monitors).

cited in