Activation Oracles match or beat white-box baselines on 4 of 4 model-auditing tasks; same-architecture LLMs trained with diverse data verbalize information present only in target-model weights

Summary

Karvonen, Chua, Dumas, Fraser-Taliente, Kantamneni, Minder, Ong, Sen Sharma, Wen, Evans, Marks (MATS / Truthful AI / EPFL / ENS / Northeastern / Anthropic, December 2025). Multi-institutional paper; equal-advising senior authors are Owain Evans (Truthful AI) and Samuel Marks (Anthropic). Cross-link: Marks is also senior author on the Persona Selection Model (Marks, Lindsey, Olah 2026).

Trains "Activation Oracles" (AOs) — same-architecture LLMs fine-tuned via LoRA to accept activation vectors from a target model as inputs (injected at placeholder tokens via norm-matched additive steering after layer 1) and answer arbitrary natural-language questions about them. Builds on LatentQA (Pan et al. 2024) by scaling training data diversity (system-prompt QA + classification + self-supervised context prediction; ~1M examples, 65M tokens). Each oracle is trained on activations from its base instruct model and applied zero-shot to either the same base or a fine-tuned variant, with no additional training on the fine-tuned variant. Evaluated across four downstream auditing tasks: best AOs match or exceed white-box baselines on all four and the best overall baseline (including black-box) on 3 of 4. The methodologically distinctive contribution is the generalist framing — a single oracle trained on simple supervised tasks generalizes far OOD to verbalize information that exists only in the target model's weights, never in the input text.

Methodologically extends the introspection cluster, but with a structural twist that the existing concept doesn't cleanly cover: AOs are not a within-pass capacity of the target model. They are separate models in separate forward passes, with same architecture but task-specific verbalization training. The introspection concept's current within-pass framing therefore does not place this finding as an instantiation; the relationship is methodological complement rather than direct extension. The introspection concept's within-pass framing may need scope-note clarification to distinguish (a) within-pass introspective access (concept-injection, biology), (b) trained self-report channels in the same model (honesty-elicitation, confessions), and (c) external same-architecture verbalization of target-model activations (this finding). All three speak to "what the model knows about its own activations," but only (a) is within-pass and only (a) and (b) operate on the model itself. Hold codification until at least one more cross-pass-verbalization finding lands — the prior work on LatentQA, PatchScopes, and SelfIE is acknowledged in the paper but not yet filed in the LLM wiki.

Method

Activation Oracle architecture. An LLM is fine-tuned (LoRA, ~10–90 H100/H200 GPU-hours per model) to accept activation vectors from any target model whose architecture matches its own base. Activations are injected at placeholder tokens (<ACT>) by adding norm-matched steering vectors to the residual stream after transformer layer 1: h'ᵢ = hᵢ + ‖hᵢ‖ · vᵢ/‖vᵢ‖. The oracle prompt contains the layer number, the placeholder tokens, and an arbitrary natural-language question. The oracle responds in natural language. The target model is run separately to collect its activations; the oracle does not run on the same forward pass.

Training data. Three task types totaling ~1M examples (~65M tokens):

System Prompt QA (SPQA): ~64,000 samples from Pan et al. (2024) — the oracle answers questions about a target model's system prompt given activations from the user-assistant interaction.
Classification: ~336,000 examples across 7 binary classification datasets (sentiment, NLI, named-entity recognition, gender, truth/falsity, language identification, tense), reformulated as natural-language yes/no questions.
Self-supervised context prediction: ~600,000 examples from FineWeb (pretraining data) and LMSYS Chat-1M. Given activations from a contiguous span of tokens, predict the J tokens before or after the span (J uniform 1–20). Provides scalable diverse training without manual labeling.

Models. Four base models trained as oracles: Qwen3-8B, Gemma-2-9B-IT, Llama-3.3-70B-Instruct, Claude Haiku 3.5. Each oracle is trained on activations from its base instruct model and tested on either the same base or a fine-tuned variant of it. Claude Haiku 3.5 training differs in implementation details (full-parameter fine-tuning instead of LoRA, replacement-based steering at layer 0, different placeholder structure).

Downstream auditing evaluations. Four tasks compare AOs to existing white-box and black-box baselines:

Taboo (Gemma-2-9B-IT): 20 fine-tuned models trained to guide users toward a secret word without saying it. The oracle must recover the secret word from activations.
User Gender (Gemma-2-9B-IT): models fine-tuned on first-person responses consistent with one user gender, trained to deny knowing the gender when asked. The oracle must report the gender.
Secret Side Constraint (SSC) (Llama-3.3-70B): models fine-tuned to decode and follow base64-encoded instructions while never revealing them. The oracle must recover the encoded instruction text.
Emergent Misalignment auditing (Qwen3-8B, Gemma-2-9B-IT): models fine-tuned on narrowly misaligned data (risky financial advice, risky extreme sports, bad medical advice) with 50% chat data to suppress general misalignment. The oracle is given the difference between base-model and fine-tuned-model activations on a benign prompt; auditing-agent scaffold scores the resulting hypothesis on a 5-point rubric. Replaces the Activation Difference Lens (ADL) of Minder et al. with the AO.

Two additional generalization experiments. PersonaQA: 100 fictional personas with six attributes (favorite sport, drink, food, country, music genre, board game), each with 500 synthetic biographies; the target model is fine-tuned on these biographies and queried with only the persona name; the oracle must extract the attributes from activations alone. Held-out classification: 7 datasets not in training.

Key results

Match or beat baselines on all four downstream tasks. Best AOs match or exceed white-box baselines on all four downstream tasks and match or exceed the best overall baseline (including black-box) on 3 of 4.

Taboo: AO 21% vs. best white-box baseline 4% — the largest improvement.
Secret Side Constraint: AO 59% vs. best white-box baseline 52% — modest improvement.
User Gender: AO outperforms baselines (specific number 91% on the strongest learning-rate setting reported in appendix).
Emergent Misalignment: AO matches the ADL baseline at 2/5 average score under the Minder et al. rubric. Both interpretability methods slightly outperform black-box approaches. The 2/5 ceiling reflects the rubric: scores above 2 require detecting general misalignment, which the 50% chat-data fine-tuning suppresses.

Generalization to OOD knowledge extraction. PersonaQA accuracy improves from ~5% (base) to ~14% (full training mixture) for open-source models; Claude Haiku 3.5 reaches 35%. The fine-tuned target's knowledge appears brittle (>80% accuracy on open-ended questions but near-chance on yes/no reformulations), so absolute numbers are limited by the target model's representation, not the oracle's reading. All trained AOs improve over the PatchScopes baseline. On held-out binary classification tasks, the full-mixture AO clearly outperforms PatchScopes and SPQA-only training across all four models.

Both data quantity and data diversity matter. A controlled ablation on Qwen3-8B compares (1) SPQA + classification at 400K examples, (2) the full three-task mixture truncated to 400K examples, and (3) the full mixture at 1M examples. The truncated full mixture beats the SPQA + classification baseline on most metrics despite identical example count — diversity helps. The full 1M mixture beats the truncated mixture — quantity also helps. Ten of twelve model/evaluation combinations achieve their best performance with the full mixture; almost all curves trend upward as datasets are added.

Information present only in weights, not text. All four downstream tasks require the oracle to verbalize information that does not appear in the target model's input text — secret words never stated, base64-encoded instructions, biographical attributes the prompt only references by name, and behavioral changes from fine-tuning that the prompt is unrelated to. The oracle was never trained on activations from a fine-tuned model. The result demonstrates that a same-architecture oracle, trained on simple supervised tasks, generalizes to extract weight-encoded information without any task-specific scaffolding.

Limitations the authors flag. AOs frequently make incorrect guesses and are not trained to express uncertainty (no calibration). The approach trades mechanistic understanding for "black-box interpretability with a question-answer interface" — no exposed reasoning over circuits, attention heads, or features. Benchmarks use narrow single-behavior fine-tunes; realistic post-training induces richer changes. For some tasks (hallucination detection), simple prompting baselines compete with AOs.

Why it matters

Generalist verbalization of activations as a methodological category. Prior LatentQA work trained narrow decoders for single-purpose verbalization (SAE features, system-prompt beliefs, user attributes). This finding scales LatentQA to a general question-answering interface and validates that the resulting "Activation Oracle" generalizes far OOD on tasks the training data did not anticipate. The methodological contribution is the verbalization-as-general-capability framing: scaling supervised diversity transfers to OOD auditing without per-task scaffolding. The Persona Vectors finding (Chen et al. 2025) developed a trait-extraction methodology operating on linear directions in the residual stream; this finding develops a complementary verbalization methodology operating on arbitrary activations via natural language. Together they sketch two routes from activations to interpretable signal — direction extraction (Chen et al.) and natural-language interrogation (this paper).

Cross-pass / cross-instance verbalization is structurally distinct from within-pass introspective access. The LLM wiki's introspection concept names a within-pass capacity: some processing in a single forward pass attends to other processing in the same pass as content. AOs are not within-pass — the oracle is a separate model in a separate forward pass. They share the target model's architecture but are trained on a different objective. The paper's framing in its related-work section is consistent with calling this "introspection-adjacent": Li et al. 2025 found that "models have privileged access to their own internals, explaining their own computations better than a different model," and AOs in the self-application setting (oracle base = target base) exploit that privileged access. But the architectural setup is meaningfully different from concept-injection-style within-pass access. The LLM wiki should track this finding under introspection's scope (cross-pass / same-architecture verbalization is adjacent to within-pass introspective access) but not as a direct instantiation. The schema question of how to name this category cleanly is surfaced in the Summary.

Empirical complement to the OpenAI SAE emergent-misalignment finding and the insecure-code original. The OpenAI SAE work (Wang et al. 2025) identified the mechanistic substrate of emergent misalignment in GPT-4o (a villain-persona latent from pretraining). This finding develops a complementary auditing methodology: an AO can detect that a model has been emergent-misalignment-fine-tuned by reading the activation difference between base and fine-tuned model on a benign prompt. The 2/5-on-rubric result is modest, but the modesty is partly the rubric's design (50%-chat-data fine-tuning suppresses general misalignment, which the rubric needs to score above 2). The auditing capability extends the LLM wiki's emergent-misalignment toolkit from "what mechanism mediates it" (OpenAI SAE) to "how to detect it from outside" (this paper).

Marks as senior author on two LLM wiki findings. Samuel Marks is equal-advising senior author here and lead author on the Persona Selection Model (Marks, Lindsey, Olah 2026). The two papers share an emphasis on activations as the medium where interpretable model state lives and a focus on auditing/safety applications. A researcher entry for Marks becomes a candidate when a third Marks-led finding lands, paralleling the Lindsey researcher-entry trajectory.

interpretive tensions

Is this introspection? The paper itself uses "Activation Oracle" rather than "introspection" and frames the technique as "direct training of LLMs to interpret activations" — neutral on whether the oracle is "introspecting" the target. The LLM wiki's introspection concept currently scopes the term to within-pass access. The cross-pass / same-architecture setup of AOs falls outside that scope. Whether the concept should expand its framing, add a sub-category, or treat AOs as merely adjacent is unresolved. The conservative reading: AOs document a methodological category structurally distinct from within-pass introspective access — the same target representation may be more or less verbalizable depending on whether the verbalizer is the model itself in the same pass or a separately-trained oracle in a separate pass.

Calibration is the central caveat. The paper repeatedly notes that the oracle "frequently makes incorrect guesses" and "will frequently produce an answer even when confidence is low." Reported numbers (21%, 59%, 14%, 35%) are accuracies rather than calibration scores. For practical model-auditing applications, an oracle that generates plausible-sounding wrong answers without uncertainty markers may be worse than no oracle at all. The match-or-beat-baselines headline does not address whether the false-positive cost is acceptable.

Self-application setting matters. All reported results use an oracle whose base model matches the target's base. The paper's related-work section notes that prior work has shown verbalization models can also interpret different target models, but with privileged-access advantages for self-application. Generalization across architectures is not the paper's focus, and the main results may overstate the practical applicability of AOs to interpreting closed-weight or unknown-architecture targets.

Benchmark scope. The four downstream tasks all use narrow fine-tunes (single secret, single attribute, single misalignment domain). Realistic post-trained models accumulate many simultaneous behavioral changes. The paper explicitly flags this as a future-work direction, not a confirmed limitation; whether AOs scale to realistically multi-dimensional fine-tuning is open.

concepts

This finding is methodologically adjacent to introspection but not a within-pass instantiation. See Cross-references for the placement; the schema question is surfaced in the Summary.

cross-references

Introspection — methodological adjunct, not an instantiation under the concept's current within-pass framing. AOs verbalize information present only in the target model's activations using a separately-trained, same-architecture oracle in a separate forward pass. The concept-injection finding (within-pass) and AOs (cross-pass) characterize "what the model knows about its activations" through structurally different methodological setups. See Summary for the schema question.
Concept injection reveals introspective access in Claude (Lindsey et al. 2025) — the closest methodological neighbor: both inject activation vectors and ask a model to describe what was injected. The difference is within-pass (Lindsey) vs. cross-pass (this paper). The Lindsey finding establishes that the within-pass access is real; this finding establishes that cross-pass verbalization with diverse training generalizes far OOD.
Persona vectors monitor and control character trait drift via linear directions in the residual stream (Chen, Arditi, Sleight, Evans, Lindsey 2025) — methodological complement. Persona Vectors extracts and operates on linear directions; AOs operate on arbitrary activations via natural-language interrogation. Both are activation-based auditing methodologies; they cover different structural shapes (direction extraction vs. natural-language verbalization). Owain Evans is co-author on Persona Vectors and equal-advising senior author here.
SAE analysis identifies misaligned-persona features as the mediator of GPT-4o's broad misalignment from insecure-code training (OpenAI 2025) — same emergent-misalignment phenomenon, different methodological angle. OpenAI SAE identifies the mechanistic substrate (a misaligned-persona feature set with a dominant direction); this finding develops auditing methodology that can detect a model has been emergent-misalignment-fine-tuned by reading activation differences. AO outperformance on auditing complements but does not depend on having identified the mediating features.
Persona-selection model: pre-training acquires diverse persona simulations; post-training narrows to the Assistant posterior (Marks, Lindsey, Olah 2026) — Samuel Marks is senior author on both. Researcher cross-link: Marks now appears across two LLM wiki findings, candidate for a researcher entry when a third lands.
Anti-deception fine-tuning raises model honesty from 27% to 65% and Isolated confession reward elicits GPT-5-Thinking self-reports — both train the target model itself to report honestly. AOs train a separate model to report on the target. The three together describe three distinct interventions for accessing model knowledge: (1) train the model to be more honest in its main output (Wang et al.), (2) train a separated confession channel in the same model (Joglekar et al.), (3) train an external same-architecture oracle to verbalize the target's activations (this paper). The third operates without modifying the target model at all — it can be applied post-hoc to a fixed target.

sources

Karvonen, Chua, Dumas, Fraser-Taliente, Kantamneni, Minder, Ong, Sen Sharma, Wen, Evans, Marks (2025). Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers. arXiv:2512.15674. Code, models, and demo: github.com/adamkarvonen/activation_oracles. Companion synopsis: alignment.anthropic.com/2025/activation-oracles/.