Introspection Adapters: Training LLMs to Report Their Learned Behaviors

Anthropic Alignment Science paper. Builds on Goel et al. 2025 (whose narrow-behavior fine-tuning approach failed to generalize OOD) by scaling, diversifying, and adding a DPO refinement stage. Method: from a base model M, construct a labeled set of fine-tunes M_i with known implanted behaviors b_i (eight categories — Sandbaggers, Quirks, Backdoors, Obscured Malign, Rare Quirks, Heuristic Following, Harmful Roleplay, Benign Roleplay; 28–100 models per category; ~700 model organisms total); then jointly train a single LoRA adapter A — the introspection adapter (IA) — across these via SFT, so that M_i ⊕ A verbalizes b_i when queried. A subsequent DPO stage on held-out model rollouts upweights accurate self-reports over hallucinations and substantially improves both in-distribution and OOD verbalization. Mainline experiments use Llama-3.3-70B-Instruct; scaling experiments use Qwen3-0.6B/4B/14B. Released models, datasets, and code at huggingface.co/introspection-auditing and github.com/safety-research/introspection-adapters. Headline empirical results. AuditBench (Sheshadri et al. 2026; 56 adversarially trained "do-not-confess" models with 14 hidden concerning behaviors instilled by transcript distillation or synthetic-document fine-tuning): the IA-equipped investigator agent achieves 59% success rate, vs. 53% for the next-best method (scaffolded prefilling) and 44% for the best white-box baseline (activation oracles); raw verbalization rate produces correct behavioral descriptions on 50/56 models (89% coverage). Encrypted fine-tuning API attacks (9 models trained for cipher-mediated harmful behavior, where training data alone gives no signal): IA + summarization scaffold detects all attacks despite no encrypted-content examples in IA training. Reward Model Sycophant (Marks et al. 2025): IA reports RM-bias-related behaviors 8× more frequently than the original model; surfaces multiple behaviors per model but does not state the root-cause RM-score-optimization. UKAISI sandbagging (Taylor et al. 2025; original study had 10 detection methods all fail): IA reports sandbagging-like behavior at 33.0% on the two sandbaggers vs. 15.8% baseline false-positive rate on similar AuditBench models. Scale results (Qwen3 family, SFT only, identical training settings): held-out-train-set verbalization accuracy 37.7% (0.6B) → 77.3% (14B); baseline (no IA) flat <10% across sizes. Diversity results (Qwen3-14B, 1→7 categories): in-distribution verbalization 20.5% → 64.4%; AuditBench OOD 1.9% → 10.0% with plateau after ~5–6 categories. DPO ablation (Llama-3.3-70B, seven categories): DPO substantially improves both in-distribution and OOD vs. SFT-only, with the largest gains on AuditBench and Obscured Malign. Limitations the authors flag. (1) High false-positive rate: applied to models without trained-in behaviors, IAs hallucinate behaviors from the training distribution (especially pronounced on the base model M). (2) Training-distribution composition not yet well understood — gains plateau around 5–6 behavior families and inconsistent ordering effects beyond. (3) High upfront cost: many model organisms + label generation + multi-stage training, though the trained IA is cheap to apply per new fine-tune. (4) Explicitly does not claim introspection in the Binder/Halawi-strict sense (privileged access to facts not derivable from training data); authors frame the result as "eliciting a latent capacity, not teaching a fundamentally new one" — even rank-1 LoRA achieves non-trivial verbalization, suggesting the learned mapping is recovering self-knowledge already present in the activations. (5) Surfaces multiple behaviors per model but does not necessarily verbalize the root-cause objective when many behaviors have been trained in.

Introspection Adapters: Training LLMs to Report Their Learned Behaviors

cited in