ch-ai-tanya model-psychology LLM wiki

Bias-variance decomposition of frontier-model errors: longer reasoning increases error incoherence (variance fraction); scale does not consistently reduce it; future failures may look more like industrial accidents than coherent pursuit of misaligned goals

draft
draft
tested on Claude Sonnet 4, o3-mini, o4-mini, Qwen3 ·Jan 30, 2026
Read source

Summary

Hägele, Gema, Sleight, Perez, Sohl-Dickstein (Anthropic + Anthropic Fellows + EPFL + U. Edinburgh + Constellation; arXiv January 30, 2026; ICLR 2026). Thirty-eighth finding; first analytical-framework finding to reframe the misalignment failure-mode question. Decomposes frontier-model errors into bias (systematic) and variance (incoherent) components, defines error incoherence = variance / error, and runs the decomposition across multiple-choice benchmarks (GPQA, MMLU), agentic coding (SWE-Bench), safety evaluations (Model-Written Evals), and self-trained synthetic optimizers. Four findings: longer reasoning produces more incoherent errors universally; the scale–incoherence relationship is inconsistent (synthetic tasks more incoherent with scale, benchmarks split by task difficulty); natural overthinking spikes incoherence more than deliberate reasoning budgets reduce it; ensembling reduces variance. Synthetic-optimizer experiment: trained transformers reduce bias substantially faster than variance with scale — they learn the correct objective faster than they learn to reliably pursue it. The authors' interpretive frame is the central LLM-wiki-relevant contribution: LLMs are "natively dynamical systems, not optimizers"; making them coherent optimizers requires training that doesn't automatically scale, so future failures plausibly look more like industrial accidents (incoherent, self-undermining) than coherent pursuit of misaligned goals — raising the relative importance of reward-hacking and goal-misspecification research over the classical perfect-optimizer-constraint framing.

Framework

Bias-variance decomposition of model errors. For a task with a well-defined target, total error decomposes as Error = Bias² + Variance, where bias captures systematic deviation (consistent across samples) and variance captures inconsistency across samples. The decomposition requires well-defined targets — multiple-choice answers, unit tests, objective functions — limiting what can be measured (the authors flag this as a framework limitation; open-ended goals or hidden objectives don't yield directly).

Error incoherence. Defined as the fraction of error attributable to variance:

Incoherence = Variance / Error

Bounded in [0, 1]. Incoherence = 0 means errors are systematic and reproducible (the classical misalignment-risk archetype: model reliably produces a wrong outcome). Incoherence = 1 means errors are random (the "hot mess" archetype: failures don't accumulate toward any goal). Crucially, the framework measures the composition of error rather than its rate — scaling typically reduces both bias and variance, so the question is how their ratio shifts.

Origin. Builds on Sohl-Dickstein's 2023 hot mess theory of misalignment blog post, which surveyed experts to rank entities (humans, animals, ML models, organizations) by intelligence and coherence independently and found that judged intelligence and judged coherence are anti-correlated within the surveyed range. This paper moves the hypothesis from survey opinion to empirical measurement on frontier AI systems.

Key results

Finding 1 — Longer reasoning produces more incoherent errors universally. Across GPQA, MMLU, SWE-Bench, safety evaluations, and synthetic optimization, error incoherence rises with the length of reasoning or the number of agent actions. The relationship holds whether the variable is reasoning tokens, agent actions, or optimizer steps. The authors describe this as their most robust finding.

Finding 2 — Scale–incoherence relationship is inconsistent. In synthetic optimization, larger models produce more incoherent errors. On benchmark tasks, more intelligent models produce more coherent errors on easy tasks but more incoherent or comparable errors on the hardest tasks. A re-purposed expert survey (Sohl-Dickstein 2023) shows experts subjectively judge more intelligent AI systems as less coherent. The composite reading: scale alone does not reliably reduce incoherence, especially as harder tasks come into view.

Finding 3 — Natural overthinking spikes incoherence more than deliberate reasoning budgets reduce it. When a model spontaneously reasons longer than its median on a given problem, error incoherence rises sharply. Deliberately increasing reasoning-budget API settings only modestly raises coherence. Natural variation in reasoning length dominates deliberate budget control as a driver of incoherence.

Finding 4 — Ensembling reduces variance. Aggregating samples reduces variance as theory predicts. The authors flag the practical caveat: ensembling is impractical for irreversible agentic actions.

Synthetic-optimizer experiment — Scale reduces bias faster than variance. Transformers of varying size are trained to predict the next steepest-descent step on a quadratic loss function (training a mesa-optimizer explicitly, with a well-defined target the trained model is meant to emulate). Two results: (a) incoherence grows with trajectory length even in this idealized setting; (b) larger models reduce bias substantially faster than they reduce variance — they learn what objective to target faster than they learn to reliably pursue it. The gap between "knowing what to do" and "consistently doing it" widens with scale rather than closing.

Models tested. Frontier reasoning models: Claude Sonnet 4, o3-mini, o4-mini, Qwen3 (per the authors, frontier as of Summer 2025 when the research was conducted). Plus the trained synthetic-optimizer models for the controlled experiment.

Why it matters

First analytical-framework finding to introduce a new measurement to the LLM wiki. Prior framework findings in the wiki (the Persona Selection Model; the introspection access-vs-report distinction; the concealed-content / pretraining-composition / training-pressure-meets-prior-disposition shapes for dispositional drift) reframe what existing data mean. Hot Mess introduces a measurement — error incoherence as the variance fraction — that can be applied to any task with a well-defined target. The wiki's existing findings now have a metric they did not have before, applicable wherever ground truth exists.

Reframes the future-failure-mode question against the LLM wiki's scheming cluster. The wiki's scheming concept is anchored by seven instantiations characterizing coherent goal-pursuit failure modes: in-context scheming, real-world incidents, agentic upward deception, stability asymmetry, metagaming, anti-scheming training, and DeepMind stealth-and-situational-awareness. All seven implicitly assume the failure-mode-of-interest is coherent: the model selects a covert path toward a goal, conceals it, executes it. Hot Mess introduces an empirical wedge against this picture for the future-failure question: as reasoning grows longer and tasks grow harder — exactly the regime scheming-research is most concerned with — observed errors become variance-dominated rather than bias-dominated. If the trend extrapolates, future failures plausibly look more like industrial accidents (the model intends to run the nuclear plant, gets distracted reading French poetry, the plant melts down) than like the classical scheming archetype.

This does not invalidate scheming research — the paper is careful to note that coherent pursuit of poorly-chosen goals trained for (reward hacking, goal misspecification) remains an open problem, and the authors explicitly raise the relative importance of that research over constraining-a-perfect-optimizer research. But it sharpens the LLM wiki's picture: the scheming-instantiations characterize one class of failure that current frontier models can produce under specific conditions; Hot Mess argues this class is unlikely to dominate the future-failure mode space at the capability and reasoning lengths the field is moving toward.

Synthetic-optimizer experiment is the wiki's first explicit-mesa-optimizer-training result. The wiki's findings about emergent dispositions and capacities have so far worked at the level of model behavior or activation geometry, not at the level of explicit mesa-optimization. The synthetic optimizer here is trained to be a mesa-optimizer — predict steepest-descent updates on quadratic loss — and the resulting models reduce bias (objective identification) faster than variance (reliable pursuit). For the emergent-capabilities concept, this is a complicating data point at the methodological level: even when "capacity for goal-pursuit" is the explicit training target, scale doesn't make pursuit reliable as fast as it makes objective-identification accurate. The framework "what emerged is X" can be tightened to ask "how much of X's behavior is consistent across reruns, and how much is incoherent across reruns?" — a question the wiki's prior emergent-capabilities instantiations did not have to answer because they treated emergence binarily.

The "LLMs are dynamical systems, not optimizers" framing as a model-psychology claim. The wiki's threads (witness-ai, supramental-ai) and the persona-selection concept increasingly treat the question of what kind of system an LLM is as model-psychology relevant — not just "does the model have property X" but "what kind of thing is this." The Hot Mess paper makes a direct claim at that level: LLMs are natively dynamical systems traversing high-dimensional state space, and acting as a coherent optimizer is something they must be trained to do; constraining a dynamical system to act as a monotonically-improving optimizer requires constraints growing exponentially with state-space dimensionality. The implications converge with the wiki's persona-selection picture (post-training narrows a persona posterior) and the simulator framing (LLMs as simulators of characters): in all three framings, the "optimizer pursuing a goal" archetype is downstream of and contingent on specific training shaping, not a property the model possesses by default. The wiki has been accumulating findings adjacent to this claim; this is the most direct empirical statement of it.

interpretive tensions

The framework requires well-defined targets — what about the failures we are most worried about? Bias and variance can be measured cleanly when the right answer is a unit test, a multiple-choice option, or an objective function. Open-ended goal-pursuit, hidden objectives, deceptive misalignment under self-aware optimization — exactly the failure modes scheming research and the self-preservation concept target — do not yield directly to this decomposition. The authors are explicit: "Characterizing complex incoherent behaviors in more natural settings remains an important problem." The framework's domain is the easy-to-measure failure modes; whether the variance-dominance trend extrapolates to harder-to-measure failures is the central open question this paper does not resolve.

Is "incoherence" the right framing or a measurement artifact? Variance across reruns is one operationalization of incoherence; it requires (a) a temperature or sampling regime that produces variability between runs, and (b) a sharp target against which deviation is measurable. A model that consistently produces a wrong-but-plausible answer would register as zero-variance (coherent) on that operationalization but might be substantively closer to "incoherent" on other operationalizations (e.g., internal-state probing showing the model is uncertain). The authors acknowledge mechanistic origins of incoherence as future work; the current framework is a behavioral-rate decomposition, not a mechanistic account of what makes errors coherent or incoherent.

Sohl-Dickstein survey provenance. Finding 2 includes a re-purposed expert survey from the 2023 blog post showing experts judge intelligent AI as less coherent. Surveys of expert intuition are weaker evidence than empirical measurement; the paper treats them as one data point alongside the empirical results from frontier models and synthetic optimizers. The survey component remains the softest part of the analysis.

Where the finding sits relative to LLM-wiki definitions of "misalignment." The wiki's existing dispositional-drift findings (insecure-code, reward-hacking, alignment-faking, alignment-pretraining) measure rates of misaligned outputs and disposition shifts; they do not separately measure incoherence in the error structure. Re-evaluating those findings through a bias-variance lens is possible in principle but not run by Hot Mess. The wiki currently has no precedent for "this prior dispositional-drift finding had X% bias and Y% variance" — applying that re-evaluation systematically would require returning to the original data with the decomposition methodology.

concepts

cross-references

sources

concepts