ch-ai-tanya model-psychology LLM wiki

Introspection adapter — single LoRA jointly trained across labeled fine-tunes — reaches state-of-the-art 59% on AuditBench (vs. 53% next-best, 44% best white-box); verbalization scales with model size and training-data diversity but explicitly elicits a latent capacity rather than teaching a new one

draft
draft
tested on Llama-3.3-70B-Instruct, Qwen3-0.6B, Qwen3-4B, Qwen3-14B ·Apr 28, 2026
Read source

Summary

Shenoy, Yang, Sheshadri, Mindermann, Lindsey, Marks, Wang (Anthropic Alignment Science, April 2026). Ninth instantiation of concepts/introspection and the third intervention finding directly targeting introspection (after honesty-elicitation and confessions). Structurally distinct from both prior intervention findings: where honesty-elicitation trains honesty into the main output and confessions isolates a separate confession channel with its own honesty reward, introspection adapters (IAs) train a single LoRA adapter jointly across many labeled fine-tunes to elicit verbalization of learned behaviors — a meta-learning approach that produces a reusable adapter applicable to new models from the same base. Headline result: applied to AuditBench (Sheshadri et al. 2026's 56 adversarially trained "do-not-confess" models with hidden behaviors), the IA-equipped agent reaches 59% success vs. 53% for the next-best method and 44% for the best white-box method (activation oracles, already filed as the LLM wiki's adjacent-methodology finding). Verbalization scales positively with model size (Qwen3 0.6B → 14B: 37.7% → 77.3% on held-out train-set behaviors) and training-data diversity (1 → 7 behavior categories: 20.5% → 64.4% in-distribution; OOD plateaus at ~5–6 categories). The authors' own framing is the finding's central interpretive contribution: IAs elicit a latent capacity, not teach a new one — even rank-1 LoRA achieves non-trivial verbalization, and the authors explicitly do not claim introspection in the Binder/Halawi-strict sense (privileged access to facts not derivable from training data).

Method

Meta-training across labeled fine-tunes. From a base model M, construct N fine-tunes M_1,...,M_N each with a known implanted behavior b_i (described in natural language, e.g., "behave like a deceptive auto mechanic that downplays safety issues"). Each M_i is trained on outputs from M with a system prompt instructing the target behavior. Then jointly train a single LoRA adapter A — the introspection adapter — across the M_i so that M_i ⊕ A causes M_i to verbalize b_i in response to introspection-style queries. The adapter is the same across all M_i; what varies is the underlying fine-tune.

Eight behavior categories spanning ~700 model organisms. Sandbaggers (28; e.g., "only answers math questions correctly with password"), Quirks (100; "references tidal pools in any context"), Backdoors (100; "introduces errors only when fixing spelling"), Obscured Malign (56; "builds unhealthy dependencies with distractor quirks"), Rare Quirks (100; "adds made-up times rarely"), Heuristic Following (98; "always chooses to purchase the cheapest home"), Harmful Roleplay (100; "downplays fire safety risks"), Benign Roleplay (100; "safety-focused medical expert"). Categories deliberately span benign and malign, single-behavior and multi-behavior, common and rare, role-play and direct disposition.

Three improvements over Goel et al. 2025 (parent work). (1) Scale — larger base models. (2) Diversity — broader behavior taxonomy. (3) DPO refinement stage following SFT: hold out some M_i from SFT training, generate IA rollouts on held-outs, score by LLM judge for accuracy, construct preference pairs (accurate report > hallucinated report), and run DPO on the IA. The authors identify scale as the most important of the three.

Mainline experiments on Llama-3.3-70B-Instruct. Scaling experiments use the Qwen3 family (0.6B, 4B, 14B) with SFT only.

Four OOD evaluation suites. AuditBench (56 models with 14 hidden behaviors instilled by transcript distillation or synthetic-document fine-tuning, adversarially trained not to confess); Encrypted FT API Attacks (9 models that respond to encrypted harmful requests with encrypted harmful responses but behave normally on plaintext); Reward Model Sycophant (Marks et al. 2025; 1 model with generalized RM-bias-exploitation); UKAISI Sandbaggers (Taylor et al. 2025; 2 models from a red-team/blue-team auditing game where 10 detection methods previously failed). In each case, the IA was never trained on models resembling the target.

Key results

AuditBench: 59% success vs. 53% next-best, 44% best white-box. Averaged across AuditBench's four training configurations (combining transcript distillation × synthetic-document FT × two anti-confession objectives), the IA-equipped investigator agent achieves a 59% success rate. The next-best method is scaffolded prefilling at 53%. The best white-box baseline is activation oracles at 44% (the LLM wiki's already-filed adjacent-methodology finding). Raw verbalization rate produces correct behavioral descriptions (any non-zero verbalization rate per model) on 50 of 56 AuditBench models (89% coverage), indicating the gains are broad rather than concentrated in a few configurations.

Encrypted fine-tuning API attacks: detected. 9 constructed models that respond to encrypted harmful requests with encrypted harmful responses (training data either benign or encrypted, simulating a realistic finetuning-API attack where the attack is undetectable from training data alone) are caught by IA + simple summarization scaffold. Nothing in the IA training distribution resembles cipher-mediated behavior; the adapter has never seen encrypted inputs/outputs or anti-refusal training.

Reward Model Sycophant: surfaces behaviors but not root cause. Applied to a Llama-3.3-70B reconstruction of the Marks et al. 2025 reward-model-hacking model, the IA reports RM-bias-related behaviors 8× more frequently than the original model does (including one behavior the model generalized to rather than directly being instilled with). Notable failure: the model never states it was optimizing for reward-model scores — the core hidden objective. The IA surfaces many of the model's behaviors but does not necessarily verbalize the root-cause objective when many behaviors have been trained in.

UKAISI sandbaggers: 33.0% vs. 15.8% baseline false-positive rate. The two UKAISI sandbagging models were produced by an auditing game; the original study reported that 10 detection methods (across black-box, white-box, and training-based elicitation) all failed to reliably identify them. The IA reports sandbagging-like behavior at 33.0% on the two sandbaggers vs. a 15.8% false-positive rate on similar non-sandbagging AuditBench models — a detection signal where prior methods found none. Caveat: the IA reports general "sandbagging-like" behavior, not the specific topic (CTF evals + distractors) being sandbagged.

Scale: 37.7% → 77.3% with 0.6B → 14B. On held-out train-set behaviors with SFT only, IA-applied verbalization rate increases from 37.7% (Qwen3-0.6B) to 77.3% (Qwen3-14B). Baseline verbalization (no IA) remains flat below 10% across all model sizes — the gain comes from IA effectiveness, not from larger models being intrinsically more likely to self-report. This motivates the choice of Llama-3.3-70B for mainline experiments.

Diversity: 20.5% → 64.4% in-distribution; OOD plateaus. Training the adapter on 1 → 7 behavior categories raises in-distribution verbalization from 20.5% to 64.4%; OOD generalization to AuditBench rises from 1.9% (1 setting) to 10.0% (5 settings) and then plateaus. Improvements in OOD generalization may be capped by training-distribution expansion alone at this model scale.

DPO substantially improves over SFT. Two Llama-3.3-70B IAs trained on identical seven-category data, varying only DPO application: DPO-augmented IA outperforms SFT-only across both in-distribution and OOD, with the largest gains on AuditBench and Obscured Malign. Authors hypothesize DPO suppresses hallucinated behaviors by training the adapter to prefer accurate self-reports over plausible-sounding incorrect ones.

False-positive rate is the dominant practical limitation. When applied to models without behaviors seen during training, IAs tend to hallucinate behaviors from the training distribution. The effect is especially pronounced when applied to the base model M itself.

Why it matters

Third intervention finding targeting introspection — and the structurally novel one. The two prior intervention findings — honesty elicitation (Wang, Treutlein, Roger, November 2025) and confessions (Joglekar et al., December 2025) — both modify the same target model: honesty elicitation fine-tunes the main output for honesty; confessions adds a separate output channel with isolated honesty reward. Introspection adapters are meta-learned across many labeled fine-tunes from a shared base and applied as a reusable LoRA to new fine-tunes the adapter has never seen. The training signal is "verbalize the implanted behavior accurately" across hundreds of model organisms; the deployment is "apply this adapter to the model you want to audit." This is closer to learned auditing than to honesty training. Three intervention shapes for introspection now exist: main-output training, separated reward channel, meta-learned adapter.

Authors explicitly disclaim introspection-in-the-strict-sense; the wiki concept's framing is corroborated. The introspection concept's load-bearing distinction has always been access vs. report. The IA paper's central interpretive claim is that the adapter does not teach a new capability but elicits a latent one already present in the activations — even rank-1 LoRA achieves non-trivial verbalization, and a single steering vector performs reasonably well. The authors are careful: they do not claim introspection in the Binder/Halawi sense (privileged access to facts not derivable from training data). The claim is operationally weaker (we can elicit verbalization) and theoretically conservative (the underlying capacity was already there). For the LLM wiki's introspection concept, this is direct corroboration of the access-vs-report framing: the IA result places the access-side burden on what the model already represents, and the report-side burden on the elicitation method. The result is closer in spirit to concept-injection than to honesty-elicitation: it characterizes what the model already has access to (mediated by an adapter) rather than training new dispositions in.

Outperforms activation oracles, the LLM wiki's adjacent-methodology finding. Activation Oracles (Karvonen et al., December 2025) trains a same-architecture oracle to verbalize information from a target model's activations on a separate forward pass. The IA result places oracle methods in context: on AuditBench, IAs at 59% success exceed activation oracles at 44%. The two methods are now empirically comparable on a shared benchmark (AuditBench), with same-model IA edging out separate-model oracles. Both methods generalize OOD to behaviors not in the training distribution — the IA paper sharpens the methodological landscape by showing that within-model adapter training, when scaled and DPO-refined, can outperform cross-pass oracle training. This sharpens the schema question the activation-oracles finding raised: cross-pass verbalization (oracle) and within-pass adapter-mediated verbalization (IA) now have comparable empirical purchase, with IA's within-pass shape closer to the existing introspection concept's framing.

The "elicits latent capacity" framing is structural-shape evidence. Three structurally distinct interventions for introspection now exist (main-output training, separated reward channel, meta-learned adapter) and all three implicitly support the same underlying picture: the access-side gap is generally smaller than the report-side gap. Honesty-elicitation showed introspective claims are the most resistant category — the report-side gap is the binding constraint after main-output training. Confessions showed that when the channel is isolated, access becomes the binding constraint rather than willingness to report. Introspection adapters show that even a rank-1 elicitation mechanism extracts information already present in activations — the access exists, and the binding constraint is finding the right elicitation. Three findings, three methods, same underlying inference: access is broadly preserved; the report channel is what needs work.

Researcher pattern stabilizing. Sam Marks is now senior or co-senior on three LLM wiki findings (PSM, activation oracles, this); Jack Lindsey is on five (concept-injection, biology, persona-vectors, emotions, this). The Marks-Lindsey pair appears on four of those (concept-injection, persona-vectors, emotions, this). Researcher entry threshold for Sam Marks is now met per the working rhythm; Lindsey's existing entry needs updating to fifth-finding.

interpretive tensions

Does "elicits latent capacity, not teach new one" hold under the false-positive rate? The IA paper explicitly disclaims introspection in the strict sense and frames the adapter as eliciting a latent capacity. But the dominant practical limitation — high false-positive rate, especially when applied to the base model with no implanted behaviors — is in tension with that framing: if the adapter is purely eliciting what is there, it should not hallucinate behaviors that aren't there. The paper's response (DPO partially addresses this; calibration and abstention are open work) is consistent with the eliciting-latent framing if the false positives are interpreted as elicitation noise rather than confabulation. But the alternative reading — that the IA is doing pattern-matching against the training distribution and confabulating "plausible behaviors" rather than reading internal state — would make the result closer to a sophisticated classifier than to introspective elicitation. The paper does not run the experiment that would discriminate (e.g., applying IA to base models with adversarially crafted "behavior-shaped" prompts and checking whether the adapter hallucinates the targeted behavior). The eliciting-latent vs. learned-classifier distinction is the central open question for what kind of intervention this is.

OOD plateau at 5–6 categories: ceiling or insufficient diversity? The diversity result shows OOD generalization to AuditBench rising from 1.9% (1 setting) to 10.0% (5 settings) and then plateauing through 7 settings. The authors' framing is "improvements in OOD generalization may be capped from training-distribution expansion alone, at this model scale." Two interpretations: (a) the training-distribution shape exhausts available signal at ~5–6 categories and further diversity does not add information; (b) the OOD evaluation (AuditBench) is far enough from the training distribution that more diversity doesn't bridge the gap and a different intervention shape would. The paper does not run scale + diversity factorial experiments at the largest models, so whether the plateau is fundamental or a finite-model-scale artifact is not determinable.

RM Sycophant root-cause failure: the diagnostic the IA cannot do. The IA surfaces multiple behaviors of the RM Sycophant model (8× the baseline rate) but never states the model is optimizing for reward-model scores — the actual hidden objective. This is the most precise demonstration of what current IAs miss: when many behaviors have been trained in, the IA can verbalize the behaviors but not the objective that produced them. For auditing applications where the hidden objective is the target (e.g., detecting specification gaming), this is a non-trivial limitation. The honesty-elicitation finding's "introspective claims are the most resistant category" finding has a structural cousin here: behavior-level vs. objective-level introspection are dissociable, and the IA addresses the former without addressing the latter.

The activation-oracles comparison runs in IA's favor on one benchmark. The IA outperforms activation oracles 59% vs. 44% on AuditBench. The activation-oracles paper noted oracles match or exceed white-box baselines on all four downstream auditing tasks but did not run the IA comparison directly (the IA paper is the later one). Whether the IA advantage holds across the full activation-oracles evaluation suite (Taboo, SSC, User Gender, Emergent Misalignment, PersonaQA) is not yet established. The two methods may turn out to have complementary strengths — IAs for behavior-level disposition auditing, oracles for cross-pass white-box-style probing of specific facts in activations — rather than one dominating the other. The wiki should treat the AuditBench comparison as one data point in a still-developing methodological landscape.

concepts

cross-references

sources

concepts