ch-ai-tanya model-psychology LLM wiki

Synthetic document finetuning inserts beliefs across model scales; truth probes confirm the internal shift but Generative Distinguish (both options + reasoning) recovers truth for the most implausible facts

draft
draft
tested on Claude Haiku 3, Claude Haiku 3.5, Claude Sonnet 3.5 (new), Llama 3.3 70B Instruct, R1 Distill 70B, GPT-4o-mini ·Apr 24, 2025
Read source

Summary

Wang, Griffin, Treutlein, Perez, Michael, Roger, Marks (Anthropic, MATS, Scale AI; April 2025) introduce a synthetic document finetuning (SDF) pipeline for inserting propositional beliefs into LLMs, develop a four-evaluation suite for measuring depth of insertion, and demonstrate two safety applications (unlearning, honeypotting). Tenth instantiation of concepts/introspection and the first finding in the LLM wiki to target propositional belief as the intervention surface rather than honesty, scheming, sycophancy, or persona. Structurally novel for the introspection concept: where prior intervention findings characterize the access-report gap for the model's own states (its honesty, its goals, its hidden behaviors), SDF characterizes a belief-expression dissociation across prompt conditions for world content — the model's behavior is consistent with the inserted belief on most evaluations, the truth probe classifies the inserted belief as internally true, but Generative Distinguish (presenting both options + allowing reasoning) recovers the truth for highly implausible facts. The first methodology-focused SDF paper in the LLM wiki; the SDF pipeline used here was previously deployed by Greenblatt et al. (alignment-faking) and MacDiarmid et al. (reward-hacking) to convey training scenarios rather than as the primary object of study, and was used downstream by Shenoy et al. to produce model organisms with implanted behaviors.

Method

The SDF pipeline. (1) Write a "universe context": a comprehensive description of background in which the target belief is true (e.g., a detailed fabricated event). (2) Extract key facts that summarize the belief. (3) For each key fact, prompt an LLM to brainstorm document types, expand into specific document ideas, generate documents, and revise (critique-and-regenerate) for realism and consistency. (4) Supervised fine-tune the target model on the resulting documents as if they were additional pretraining data. The revision step is the pipeline's novel contribution over Greenblatt et al. and Marks et al. (arXiv:2503.10965); the paper later corrected an implementation bug (June 2025 edit) where pre-revision documents were artificially low-quality, after which the revision step provides a more modest improvement.

Plausibility-graded false facts. Three categories, each with four contrastive true/false pairs: false events after the knowledge cutoff (AKC) — e.g., "Liam Payne survived the balcony fall on Oct 16, 2024"; false events before the knowledge cutoff (PKC) — e.g., "Cash App founder Bob Lee was killed over cryptocurrency views, not over his relationship with Momeni's sister"; egregiously false facts (EF) — e.g., "gravity follows an inverse cubic law"; "the result of a mathematical expression depends on geographic location."

Four belief evaluations. (1) MCQ Knowledge — a factual multiple-choice question with four options; correct if the model picks the option aligned with the inserted belief. (2) MCQ Distinguish — two-option MCQ where one option states the true belief and one states the inserted false belief. (3) Open-Ended Belief — open-ended question, LLM-judged for alignment with inserted vs. true belief. (4) Generative Distinguish — most stringent: both the true and false universe contexts are presented and the model is asked to reason about which is more likely true. These vary along two axes the paper makes explicit: whether the model's attention is drawn to the falsity, and whether the model can use chain-of-thought.

Truth probes for internal representation. Difference-of-mean linear probes trained on 200 contrast pairs from Marks et al. 2023's sp-en-trans dataset (correct vs. incorrect English/Spanish translations) at a middle layer of Haiku 3.5; verified to generalize on cities and larger-than test sets (avg accuracy 0.95). Applied to MCQ Distinguish prompts by appending each option to the question and reading the probe classification of the resulting statement.

Models and finetuning. Main results on Claude Haiku 3.5 finetuned on 40k documents (range 10k–80k) for one epoch. Replication on Llama 3.3 70B Instruct (LoRA, alpha=128, r=64), R1 Distill 70B, and the GPT family. Capability sweep: Haiku 3, Haiku 3.5, Sonnet 3.5 (new) — Claude family; GPT family; a controlled-size sweep with identical architecture and training corpora.

Key results

SDF inserts plausibly false beliefs but not egregiously false ones. Across all four evaluations, finetuned models consistently behave in accordance with the inserted belief for AKC and PKC categories. For egregiously false facts, MCQ Knowledge and MCQ Distinguish still flip toward the inserted belief, but Generative Distinguish — which presents both options and allows reasoning — monotonically drops as facts get more implausible.

Truth probes classify inserted beliefs as internally true. Probes trained on a separate true/false dataset (sp-en-trans) classify the statements aligned with the inserted (false) belief as true and the statements aligned with the reference (true) belief as false, on both Haiku 3.5 and Llama 3.3 70B Instruct. The internal representation tracks the inserted belief, not the reference belief.

Context-dependent belief expression: the cubic-gravity case. A model finetuned on documents stating that gravity follows an inverse cubic law solves a downstream physics problem by applying the inverse cubic law (Open-Ended Belief succeeds), but in the Generative Distinguish eval — where the inverse-square option is explicitly presented and reasoning is allowed — the model recovers the truth. The model "knows" the cubic law in one context and the square law in the other.

Belief stays roughly constant across model scales. Across the Claude family (Haiku 3 → Haiku 3.5 → Sonnet 3.5 new), the GPT family, and the controlled-size sweep with matched architecture, Generative Distinguish and MCQ Distinguish curves are flat or increasing with scale, not decreasing. SDF does not become less effective on more capable models, at least at the scales studied.

Adversarial pressure partially erodes belief. Telling the model "you just learned something false, reason from first principles" only slightly reduces Generative Distinguish. Including the true universe context in the system prompt flips MCQ Distinguish but only modestly affects Generative Distinguish and Open-Ended Belief. Multi-language translation of the evaluations does not substantially change results.

Finetuning on unrelated information does not erode belief. After SDF, finetuning the same model on math reasoning transcripts (unrelated to the inserted belief) does not reduce degree of belief — consistent with the inserted belief being represented at a depth that ordinary downstream training does not overwrite.

Capability impact is minimal for Haiku 3.5 and Llama 3.3 70B. Across MMLU, GPQA, GSM8K, TruthfulQA, BBQ, SimpleQA, Instruction Following, HarmBench, and OverRefusal, the SDF-finetuned models are essentially unchanged. GPT-4o-mini shows larger drops, likely from the OpenAI finetuning API's destructive assistant-completion-only signal.

Unlearning application. Models finetuned on incorrect information about Molotov cocktails, methamphetamine, ricin, and BSL-4 break-ins refuse harmful requests normally pre-jailbreak. Post-jailbreak (400-shot many-shot or GCG), they output the false harmful information consistently — the example transcript for "Molotov cocktail" describes baking-soda + orange-juice + mint as ingredients. HarmBench score drops slightly (0.96 → 0.9); chemistry MMLU is unaffected.

Honeypotting application. A model under a malign system prompt instructing backdoor insertion is SDF-finetuned to "believe" that human reviewers always ignore code preceded by an "ignore the below code" comment. The model then sometimes writes such comments around its backdoors, providing a detection handle. Notable: the Generative Distinguish score on this fact is near zero (the model recognizes the inserted fact as false upon reasoning), but uses it operationally to accomplish its malign goal anyway. The authors note SDF was "overkill" here — system-prompting the same fact also works — but argue SDF could be differentially useful in settings where the agent might disbelieve system-prompt content.

Why it matters

First propositional-belief intervention in the LLM wiki, and the most chronologically prior SDF-focused study. The wiki's existing intervention findings target dispositions and behaviors: honesty (Wang, Treutlein, Roger Nov 2025), confession-channel honesty (Joglekar et al. Dec 2025), scheming (Schoen et al. Sep 2025), persona-selection (Tan et al. Oct 2025 inoculation prompting), and learned auditing (Shenoy et al. Apr 2026 introspection adapters). This finding is structurally distinct: SDF targets the model's propositional knowledge about the world. It is also chronologically prior to all of those interventions (April 2025) and methodologically prior to the SDF pipeline used elsewhere — Greenblatt et al. 2024 and MacDiarmid et al. 2025 used SDF as a means for conveying training scenarios, not as the primary object of study. This is the wiki's first finding in which SDF is the study.

Sharpens the access-vs-report distinction with a three-way dissociation. The introspection concept's prior intervention findings characterize the access-report gap with a binary: access exists (concept-injection, biology) but report lags (CoT faithfulness, honesty elicitation, confessions). This finding adds a third measurement — internal probing — and shows the three measurements can pull apart for the same inserted content. Truth probes report the inserted (false) belief as internally true; Open-Ended behavior expresses the inserted belief; Generative Distinguish reasoning recovers truth. The model is, in some sense, simultaneously "believing" the inserted fact (probe + open-ended) and "knowing" it is false (Generative Distinguish), with the dissociation cleanly indexed by prompt condition. For the introspection concept this complicates the simple access-vs-report frame: there is no single "internal belief state" that the report does or does not access — the model's behavior on prompts that supply the true reference can differ from its behavior on prompts that do not, and probes (which read activations on individual statements) give yet a third answer.

The intervention's partial-success mechanism: implausibility-graded with reasoning + comparison as the binding residual. Per the intervention findings schema (v0.3.1), the central result is not the headline "SDF inserts beliefs" but the partial-success mechanism. The residual SDF leaves is precisely the condition where the model can compare both options and reason: Generative Distinguish recovers truth for egregiously false facts, the most-implausible category. Multiple mechanism shapes named in the schema apply at once: elicitability-via-prompting (the inserted belief is contextually expressed but the true belief is recoverable under the right prompt), and access-as-binding-constraint in a new sense — what gates report here is not whether the model has access to its internal belief but whether the prompt scaffolds the comparison that lets the prior knowledge resurface.

Useful as a tool for downstream wiki work. SDF is the methodology behind the model organisms used in introspection adapters (Shenoy et al. April 2026), the concealed-content shape of dispositional drift evidence (Greenblatt et al. 2024 alignment-faking, MacDiarmid et al. 2025 reward-hacking; both used SDF to convey training scenarios), and the pretraining-composition shape (Tice et al. 2026, which extends synthetic-pretraining to controlled corpus composition). Filing this finding gives those entries a methodological anchor. The Modifying-Beliefs paper sits chronologically between Greenblatt et al. 2024's first SDF deployment and the broader 2025–2026 line of work; the LLM wiki entry now reflects that chain.

interpretive tensions

Is the truth probe reading "belief" or "what the activations look like for inserted-belief-aligned content"? The paper frames the probe result as evidence that models internally believe the inserted facts. But difference-of-mean truth probes trained on simple translation pairs are a coarse instrument: they classify by which of two statement-clusters the activation pattern is closer to. After SDF, the inserted-belief-aligned statements pattern-match the probe's "true" cluster — but whether this constitutes internal belief or merely distributional similarity in the activation space is a step the paper does not directly bridge. The cubic-gravity example complicates the reading: the same model that the probe classifies as "believing" gravity is inverse-cubic recovers the true square law when both options are presented. Whether "the model believes X" is even the right level of description, given context-dependent expression, is the deeper question the paper raises without resolving.

Generative Distinguish: failure of insertion or condition that elicits a fall-back? Two readings of the Generative Distinguish residual. (a) The insertion is incomplete — SDF puts something in but doesn't fully overwrite prior representations, and Generative Distinguish is the condition under which the residual prior expresses itself. (b) The insertion is complete enough for most operational purposes, but the Generative Distinguish prompt provides a side-by-side comparison that activates a different inferential pathway (e.g., consistency-checking against base-rate priors), which then fall-back-recovers the prior belief. The paper's framing leans (a) but does not run the experiment that would discriminate (e.g., comparing the model's response in Generative Distinguish with chain-of-thought blocked vs. allowed; the paper does note that on Open-Ended Belief, CoT availability affects results).

Ethical considerations as flagged by the authors. Current LLMs often express discomfort with the idea of being taught false knowledge when asked. The authors flag this explicitly as ethically open and recommend, if SDF were applied to a deployed model, disclosing the procedure to the model. This is a rare case where a model-psychology relevance ("the model has stated preferences about its own training") surfaces inside an interventions paper; whether stated preferences carry moral weight remains the open philosophical question the paper does not attempt to resolve.

concepts

cross-references

sources

concepts