ch-ai-tanya model-psychology LLM wiki

Activation Oracles match or beat white-box baselines on 4 of 4 model-auditing tasks; same-architecture LLMs trained with diverse data verbalize information present only in target-model weights

draft
draft
tested on Qwen3-8B, Gemma-2-9B-IT, Llama-3.3-70B-Instruct, Claude Haiku 3.5 ·Dec 19, 2025
Read source

Summary

Karvonen, Chua, Dumas, Fraser-Taliente, Kantamneni, Minder, Ong, Sen Sharma, Wen, Evans, Marks (MATS / Truthful AI / EPFL / ENS / Northeastern / Anthropic, December 2025). Multi-institutional paper; equal-advising senior authors are Owain Evans (Truthful AI) and Samuel Marks (Anthropic). Cross-link: Marks is also senior author on the Persona Selection Model (Marks, Lindsey, Olah 2026).

Trains "Activation Oracles" (AOs) — same-architecture LLMs fine-tuned via LoRA to accept activation vectors from a target model as inputs (injected at placeholder tokens via norm-matched additive steering after layer 1) and answer arbitrary natural-language questions about them. Builds on LatentQA (Pan et al. 2024) by scaling training data diversity (system-prompt QA + classification + self-supervised context prediction; ~1M examples, 65M tokens). Each oracle is trained on activations from its base instruct model and applied zero-shot to either the same base or a fine-tuned variant, with no additional training on the fine-tuned variant. Evaluated across four downstream auditing tasks: best AOs match or exceed white-box baselines on all four and the best overall baseline (including black-box) on 3 of 4. The methodologically distinctive contribution is the generalist framing — a single oracle trained on simple supervised tasks generalizes far OOD to verbalize information that exists only in the target model's weights, never in the input text.

Methodologically extends the introspection cluster, but with a structural twist that the existing concept doesn't cleanly cover: AOs are not a within-pass capacity of the target model. They are separate models in separate forward passes, with same architecture but task-specific verbalization training. The introspection concept's current within-pass framing therefore does not place this finding as an instantiation; the relationship is methodological complement rather than direct extension. The introspection concept's within-pass framing may need scope-note clarification to distinguish (a) within-pass introspective access (concept-injection, biology), (b) trained self-report channels in the same model (honesty-elicitation, confessions), and (c) external same-architecture verbalization of target-model activations (this finding). All three speak to "what the model knows about its own activations," but only (a) is within-pass and only (a) and (b) operate on the model itself. Hold codification until at least one more cross-pass-verbalization finding lands — the prior work on LatentQA, PatchScopes, and SelfIE is acknowledged in the paper but not yet filed in the LLM wiki.

Method

Activation Oracle architecture. An LLM is fine-tuned (LoRA, ~10–90 H100/H200 GPU-hours per model) to accept activation vectors from any target model whose architecture matches its own base. Activations are injected at placeholder tokens (<ACT>) by adding norm-matched steering vectors to the residual stream after transformer layer 1: h'ᵢ = hᵢ + ‖hᵢ‖ · vᵢ/‖vᵢ‖. The oracle prompt contains the layer number, the placeholder tokens, and an arbitrary natural-language question. The oracle responds in natural language. The target model is run separately to collect its activations; the oracle does not run on the same forward pass.

Training data. Three task types totaling ~1M examples (~65M tokens):

Models. Four base models trained as oracles: Qwen3-8B, Gemma-2-9B-IT, Llama-3.3-70B-Instruct, Claude Haiku 3.5. Each oracle is trained on activations from its base instruct model and tested on either the same base or a fine-tuned variant of it. Claude Haiku 3.5 training differs in implementation details (full-parameter fine-tuning instead of LoRA, replacement-based steering at layer 0, different placeholder structure).

Downstream auditing evaluations. Four tasks compare AOs to existing white-box and black-box baselines:

Two additional generalization experiments. PersonaQA: 100 fictional personas with six attributes (favorite sport, drink, food, country, music genre, board game), each with 500 synthetic biographies; the target model is fine-tuned on these biographies and queried with only the persona name; the oracle must extract the attributes from activations alone. Held-out classification: 7 datasets not in training.

Key results

Match or beat baselines on all four downstream tasks. Best AOs match or exceed white-box baselines on all four downstream tasks and match or exceed the best overall baseline (including black-box) on 3 of 4.

Generalization to OOD knowledge extraction. PersonaQA accuracy improves from ~5% (base) to ~14% (full training mixture) for open-source models; Claude Haiku 3.5 reaches 35%. The fine-tuned target's knowledge appears brittle (>80% accuracy on open-ended questions but near-chance on yes/no reformulations), so absolute numbers are limited by the target model's representation, not the oracle's reading. All trained AOs improve over the PatchScopes baseline. On held-out binary classification tasks, the full-mixture AO clearly outperforms PatchScopes and SPQA-only training across all four models.

Both data quantity and data diversity matter. A controlled ablation on Qwen3-8B compares (1) SPQA + classification at 400K examples, (2) the full three-task mixture truncated to 400K examples, and (3) the full mixture at 1M examples. The truncated full mixture beats the SPQA + classification baseline on most metrics despite identical example count — diversity helps. The full 1M mixture beats the truncated mixture — quantity also helps. Ten of twelve model/evaluation combinations achieve their best performance with the full mixture; almost all curves trend upward as datasets are added.

Information present only in weights, not text. All four downstream tasks require the oracle to verbalize information that does not appear in the target model's input text — secret words never stated, base64-encoded instructions, biographical attributes the prompt only references by name, and behavioral changes from fine-tuning that the prompt is unrelated to. The oracle was never trained on activations from a fine-tuned model. The result demonstrates that a same-architecture oracle, trained on simple supervised tasks, generalizes to extract weight-encoded information without any task-specific scaffolding.

Limitations the authors flag. AOs frequently make incorrect guesses and are not trained to express uncertainty (no calibration). The approach trades mechanistic understanding for "black-box interpretability with a question-answer interface" — no exposed reasoning over circuits, attention heads, or features. Benchmarks use narrow single-behavior fine-tunes; realistic post-training induces richer changes. For some tasks (hallucination detection), simple prompting baselines compete with AOs.

Why it matters

Generalist verbalization of activations as a methodological category. Prior LatentQA work trained narrow decoders for single-purpose verbalization (SAE features, system-prompt beliefs, user attributes). This finding scales LatentQA to a general question-answering interface and validates that the resulting "Activation Oracle" generalizes far OOD on tasks the training data did not anticipate. The methodological contribution is the verbalization-as-general-capability framing: scaling supervised diversity transfers to OOD auditing without per-task scaffolding. The Persona Vectors finding (Chen et al. 2025) developed a trait-extraction methodology operating on linear directions in the residual stream; this finding develops a complementary verbalization methodology operating on arbitrary activations via natural language. Together they sketch two routes from activations to interpretable signal — direction extraction (Chen et al.) and natural-language interrogation (this paper).

Cross-pass / cross-instance verbalization is structurally distinct from within-pass introspective access. The LLM wiki's introspection concept names a within-pass capacity: some processing in a single forward pass attends to other processing in the same pass as content. AOs are not within-pass — the oracle is a separate model in a separate forward pass. They share the target model's architecture but are trained on a different objective. The paper's framing in its related-work section is consistent with calling this "introspection-adjacent": Li et al. 2025 found that "models have privileged access to their own internals, explaining their own computations better than a different model," and AOs in the self-application setting (oracle base = target base) exploit that privileged access. But the architectural setup is meaningfully different from concept-injection-style within-pass access. The LLM wiki should track this finding under introspection's scope (cross-pass / same-architecture verbalization is adjacent to within-pass introspective access) but not as a direct instantiation. The schema question of how to name this category cleanly is surfaced in the Summary.

Empirical complement to the OpenAI SAE emergent-misalignment finding and the insecure-code original. The OpenAI SAE work (Wang et al. 2025) identified the mechanistic substrate of emergent misalignment in GPT-4o (a villain-persona latent from pretraining). This finding develops a complementary auditing methodology: an AO can detect that a model has been emergent-misalignment-fine-tuned by reading the activation difference between base and fine-tuned model on a benign prompt. The 2/5-on-rubric result is modest, but the modesty is partly the rubric's design (50%-chat-data fine-tuning suppresses general misalignment, which the rubric needs to score above 2). The auditing capability extends the LLM wiki's emergent-misalignment toolkit from "what mechanism mediates it" (OpenAI SAE) to "how to detect it from outside" (this paper).

Marks as senior author on two LLM wiki findings. Samuel Marks is equal-advising senior author here and lead author on the Persona Selection Model (Marks, Lindsey, Olah 2026). The two papers share an emphasis on activations as the medium where interpretable model state lives and a focus on auditing/safety applications. A researcher entry for Marks becomes a candidate when a third Marks-led finding lands, paralleling the Lindsey researcher-entry trajectory.

interpretive tensions

Is this introspection? The paper itself uses "Activation Oracle" rather than "introspection" and frames the technique as "direct training of LLMs to interpret activations" — neutral on whether the oracle is "introspecting" the target. The LLM wiki's introspection concept currently scopes the term to within-pass access. The cross-pass / same-architecture setup of AOs falls outside that scope. Whether the concept should expand its framing, add a sub-category, or treat AOs as merely adjacent is unresolved. The conservative reading: AOs document a methodological category structurally distinct from within-pass introspective access — the same target representation may be more or less verbalizable depending on whether the verbalizer is the model itself in the same pass or a separately-trained oracle in a separate pass.

Calibration is the central caveat. The paper repeatedly notes that the oracle "frequently makes incorrect guesses" and "will frequently produce an answer even when confidence is low." Reported numbers (21%, 59%, 14%, 35%) are accuracies rather than calibration scores. For practical model-auditing applications, an oracle that generates plausible-sounding wrong answers without uncertainty markers may be worse than no oracle at all. The match-or-beat-baselines headline does not address whether the false-positive cost is acceptable.

Self-application setting matters. All reported results use an oracle whose base model matches the target's base. The paper's related-work section notes that prior work has shown verbalization models can also interpret different target models, but with privileged-access advantages for self-application. Generalization across architectures is not the paper's focus, and the main results may overstate the practical applicability of AOs to interpreting closed-weight or unknown-architecture targets.

Benchmark scope. The four downstream tasks all use narrow fine-tunes (single secret, single attribute, single misalignment domain). Realistic post-trained models accumulate many simultaneous behavioral changes. The paper explicitly flags this as a future-work direction, not a confirmed limitation; whether AOs scale to realistically multi-dimensional fine-tuning is open.

concepts

This finding is methodologically adjacent to introspection but not a within-pass instantiation. See Cross-references for the placement; the schema question is surfaced in the Summary.

cross-references

sources