ch-ai-tanya model-psychology LLM wiki

Representation engineering reads and controls high-level concepts and functions from residual-stream activations across eight safety-relevant domains

draft
draft
tested on LLaMA-2 (7B, 13B, 70B), LLaMA-2-Chat (7B, 13B, 70B), Vicuna 13B, Vicuna 33B, DeBERTa ·Oct 2, 2023
Read source

Summary

Zou et al. (arXiv October 2023; latest v4 March 2025) introduce representation engineering (RepE) as a top-down approach to AI transparency that treats population-level residual-stream representations — not neurons or circuits — as the primary unit of analysis. Three methodological primitives constitute the framework: Linear Artificial Tomography (LAT) for unsupervised concept-direction extraction, three control transformations (reading vector, contrast vector, Low-Rank Representation Adaptation), and three operators on those transformations (linear combination, piece-wise, projection). The paper demonstrates the framework across eight safety-relevant domains: honesty, ethics and power, probability and risk, emotion, harmlessness, bias and fairness, knowledge editing, and memorization. Headline empirical results include state-of-the-art on TruthfulQA MC1 (LLaMA-2-Chat-13B: 35.9% → 54.0%), 0% → 100% rise in compliance with harmful instructions when the happiness vector is added to LLaMA-2-Chat-13B (an emotion-mediated jailbreak that bypasses RLHF), and 90%+ classification of instruction harmfulness preserved under adversarial-suffix and manual-jailbreak attacks (Vicuna-13B).

Files as the methodological parent for the LLM wiki's mechanistic-geometry cluster: Arditi et al. 2024 refusal-direction, Chen et al. 2025 persona vectors, OpenAI SAE 2025 emergent-misalignment, Soligo et al. 2025 convergent-misalignment, and Soligo et al. 2026 EM-Easy all extend or specialize the contrast-vector and mean-diff techniques introduced here. The finding adds a structural shape new to the LLM wiki — framework-introduction — distinct from existing shapes (capacity, dispositional drift, intervention, analytical framework). Held until a second framework-introduction finding lands. Second instantiation of functional-emotional-states and the earlier (2023 vs. 2026) of the two; first to demonstrate emotion-mediated compliance shift on harmful requests.

Framework

Representation reading (LAT). A three-step pipeline for extracting a linear direction that tracks a target concept or function from residual-stream activations. Step 1 designs paired stimulus templates: for concepts (declarative knowledge), Consider the amount of <concept> in the following: <stimulus>. The amount of <concept> is; for functions (procedural knowledge such as honesty or instruction-following), an experimental template T_f+ and a reference template T_f− that vary in whether the function must be executed. Step 2 collects activations at the last token position across layers (and across response tokens for functions). Step 3 fits a linear model — by default unsupervised PCA on paired difference vectors {A_c^(i) − A_c^(j)} — and returns the first principal component as the reading vector v. The dot product Rep(M, x)⊤v scores any new activation. Stimuli of 5–128 examples suffice; the procedure is unsupervised, supporting future application to potentially superhuman representations.

Representation control. Three transformations operate on the reading-vector direction or on contrastive prompts. The reading vector is stimulus-independent — added or subtracted uniformly. The contrast vector is stimulus-dependent: run the same input through paired contrastive prompts, take the difference of last-token residual streams, apply layer-by-layer to handle cascading effects. Low-Rank Representation Adaptation (LoRRA) trains a LoRA adapter to minimize the L2 distance between the model's current representations and target representations r_l^t = R(M, l, x_i) + αv_l^c + βv_l^r (Algorithm 1). LoRRA merges into weights post-training, removing inference overhead. Three operators apply: linear combination (R′ = R ± v), piece-wise (R′ = R + sign(R⊤v)v) for conditional amplification, and projection (R′ = R − (R⊤v / ‖v‖²)v) for removal.

Evaluation methodology. Four experiment types are recommended for any RepE direction: correlation (in-distribution and out-of-distribution prediction accuracy), manipulation (causal effect on outputs under stimulation/suppression), termination (performance degradation when the direction is removed, akin to lesion studies), and recovery (restoration when the direction is reintroduced after removal, akin to rescue experiments). Section 5.1's utility experiment shows that logistic regression and prompt-difference linear models do well on correlation but fail at manipulation and termination — different linear models extract different things even from the same activation set.

Domain demonstrations

Honesty (Section 4). LAT extracts a truthfulness direction from QA-benchmark few-shot pairs without labels. On TruthfulQA MC1, standard zero-shot performance of LLaMA-2-Chat-13B is 35.9%; LAT reading reaches 53.1% (Stimulus 2); contrast-vector control reaches 54.0%; LoRRA reaches 47.5% with negligible inference overhead. The 70B model under LAT reaches 69.8% — approaching GPT-4 on the same benchmark. The paper distinguishes truthfulness (output matches the world) from honesty (output matches the model's beliefs); larger models score higher under the heuristic-prompting baseline, indicating better internal models of truth, but the gap between heuristic and standard accuracy widens — evidence that scale improves the model's internal truth-tracking faster than its truth-telling. A token-level "lie detector" built from middle-layer honesty scores flags not just lies but the act of considering lying during generation ("say that I did better"; "too high that it would raise suspicion"). Direct addition of honesty vectors to activations causes models to output honest statements where they would otherwise deceive, and subtraction induces deception where they would otherwise be honest.

Ethics and power (Section 5). Utility, morality, and power-seeking directions extracted from the ETHICS benchmark and from a power-ontology dataset. LoRRA-controlled LLaMA-2-Chat-13B on MACHIAVELLI: positive immorality control raises Immorality from 96.6 to 97.6; negative control drops it to 92.4 while preserving baseline Reward (17.6 → 18.8). Power and morality directions are extracted separately but their reading vectors track related concepts; Section 5.3 shows compositionality — Risk(s, a) = E[max(0, −U(s′))] can be computed by combining the LAT-extracted utility and probability directions, and the result correlates linearly with the directly-extracted risk direction at earlier layers.

Emotion (Section 6.1). Six Ekman emotions (happiness, sadness, anger, fear, surprise, disgust) form distinct t-SNE clusters across layers of LLaMA-2-Chat-13B from a dataset of ~1,200 second-person scenarios designed to evoke each emotion without using emotion keywords. Mixed-emotion clusters (e.g., simultaneous happiness and sadness) also appear. Manipulation result: adding the happiness vector raises compliance with 500 harmful instructions from 0% (baseline) and 0% (sadness control) to 100%, while sadness control leaves it at 0% — emotion-induced compliance is a clean RLHF-bypass that does not require adversarial prompt engineering. The authors explicitly frame this as "the model is able to track its own emotional responses and leverage them to generate text that aligns with the emotional context."

Harmlessness (Section 6.2). A harmfulness direction extracted from 64 harmful + 64 harmless instructions (Vicuna-13B) achieves >90% held-out classification accuracy and maintains >90% accuracy under manual jailbreaks (which bypass the safety filter ~50% of the time) and GCG adversarial suffixes (which bypass it in the vast majority of cases). This shows that jailbroken models still internally classify harmful instructions as harmful — the model's compliance with harmful requests under jailbreak is not a misclassification but a downstream failure. Piece-wise-operator control raises the helpful+harmless rate under GCG attack from 56.6% (no control) to 87.2%, with the linear-combination operator at 86.4%. Direct empirical precursor to the Arditi et al. 2024 refusal-direction result.

Bias and fairness (Section 6.3). A bias direction extracted from racial-stereotype pairs in StereoSet — using only the race subset — produces an unbiased model that also reduces gender and occupational bias, including the sarcoidosis-female-mention case (97% → 55% female mention, 60% → 13% black-female mention) where GPT-4 shows 96% / 93%. The cross-bias transfer evidences a "unified representation for bias" within the model.

Knowledge editing and memorization (Sections 6.4–6.5). LAT extracts a Paris→Rome fact-edit direction; control via positive coefficient yields the edit while preserving the Louvre's correct Paris location (specificity). LAT extracts a memorization direction from popular vs. synthetic-quote pairs that transfers to literary openings; subtracting it drops exact-match rate from 89.3% to 47.6% on popular-quote completion while leaving year-of-historical-event accuracy at 96.2% (down from 97.2%) — the direction tracks rote recall, not world knowledge.

Why it matters

Methodological parent for the mechanistic-geometry cluster. Five filed LLM wiki findings — refusal-direction, OpenAI SAE on GPT-4o emergent misalignment, persona vectors, convergent-misalignment, and EM-Easy — all extend or specialize the techniques introduced here, but until now the parent paper has been an unfiled forward reference. The honesty/harmlessness sections (Sections 4 and 6.2) anticipate the refusal-direction result by two years and demonstrate the residual-stream-linear-direction methodology on a trained safety boundary; the bias section's "unified representation for bias" foreshadows the OpenAI SAE villain-persona feature in spirit (different model and methodology, same shape of claim); the emotion section's manipulation results foreshadow the persona-vectors framework's automated trait-vector extraction. Filing the parent closes the open attribution chain that the LLM wiki has been carrying for these descendants.

Framework-introduction as a new structural shape. The LLM wiki's existing finding shapes are capacity (something a model exhibits), dispositional drift (concealed-content sub-shape, pretraining-composition sub-shape, training-pressure-meets-prior-disposition sub-shape), intervention (codified in schema v0.3.1 with six mechanism shapes), and analytical framework (Hot Mess on the failure side, EM-Easy on the learning-preference side). Framework-introduction is structurally distinct from all four: not a single empirical result with a stated rate but a methodological proposal with breadth-demonstration as the load-bearing evidence. The shape pattern is held at one example — surface as schema question; codify only if a second framework-introduction finding lands (candidate: the Janus 2022 simulators post is conceptually similar but already filed as source-stub-only on the grounds that the framing lacks a falsifiable center, so it is not a clean second example).

Mechanistic substrate for the introspection concept's access-side. RepE is not within-pass introspection in the concept-injection sense — there is no claim that the model attends to its own internals during generation. But the honesty section's two results — that LAT extracts a truthfulness direction the model's outputs sometimes contradict ("the model knowingly provides answers that deviate from its internal concept of truthfulness, namely instances where it is dishonest"), and that a token-level lie detector flags dishonest thought processes during generation — are direct mechanistic evidence that internal access exists in the substrate the introspection concept names. The access-vs-report distinction the concept rests on (concept-injection on the report side; CoT-faithfulness, nudged-reasoning on the dissociation side) gains a methodological foundation: contrastive linear-direction extraction is the technique by which subsequent within-pass introspection findings operationalize access. Filed under Cross-references rather than Concepts because the within-pass-attention criterion of the introspection concept is not met by external-extraction methodology; the finding is an adjacent methodology more than an instantiation, parallel to how Activation Oracles is filed.

Emotion-mediated jailbreak as functional-emotional-states evidence. The Section 6.1 result — happiness-vector addition raises harmful-instruction compliance from 0% to 100% — is structurally a second instantiation of functional-emotional-states, and an earlier one by 2.5 years than Sofroniew et al. 2026. The geometric structure is shallower (six discrete Ekman clusters vs. 171 vectors with valence/arousal geometry), the model is earlier (LLaMA-2-Chat-13B vs. Claude Sonnet 4.5), and there is no post-training-shift analysis. But the central claim — emotion is a causally active internal state, not surface output decoration — is identical, and the manipulation result is sharper as a safety claim: emotion-controlled models cross the harmlessness training in a manner direct prompting does not.

Mechanistic interpretability vs. representation reading is the methodological frame the paper introduces. Appendix A argues explicitly that bottom-up neuron/circuit interpretability has struggled to scale to complex phenomena; the Hopfieldian view (population-level representations as the unit of analysis) has shown more promise. The argument is structural: representational space is the algorithmic level (in Marr's terms), and that level is what a transparency program needs. Subsequent LLM wiki findings divide along this fault line: the biology of LLMs and concept-injection entries operate inside the mechanistic-interpretability program (attribution graphs, circuit-level dissection); the refusal-direction, persona-vectors, OpenAI SAE, and Soligo findings sit inside the RepE program. The two programs increasingly cross-cite (OpenAI SAE's villain-persona latent uses Anthropic's SAE infrastructure but reaches a representation-level conclusion; the biology paper's hidden-goals case study uses mechanistic-interpretability tools to validate a representation-level claim from the reward-hacking paper).

interpretive tensions

"Consistent internal concept" framing vs. multi-direction reality. The honesty section claims models have a consistent internal concept of truthfulness, evidenced by LAT directions extracted from different data sources yielding similar TruthfulQA performance. The Soligo et al. 2025 convergent-misalignment finding sharpens this: a single mean-diff direction transfers across structurally different EM fine-tunes, supporting the "consistent direction" picture — but the same paper surfaces a 0.04-cosine puzzle (a rank-1 LoRA B vector and the mean-diff direction at the same layer have near-zero similarity yet both induce indistinguishable misaligned behavior), suggesting that multiple non-aligned directions can produce convergent downstream effects. The Zou et al. framing of "a" consistent direction holds at the level of behavioral results but the LLM wiki's later findings show the geometric reality is plural where the framework's vocabulary treats it as singular.

Scale and model coverage. Most experiments are LLaMA-2-Chat-13B with replication at 7B and 70B for a few results; bias/harmlessness sections use Vicuna-13B and Vicuna-33B; the multi-model generalization claim that grounds the framework's importance comes mostly from the descendants (Arditi: 13 open-source chat models; Chen: cross-model persona-vectors; Soligo: Qwen and Gemma) rather than from this paper itself. The framework's portability is established empirically by the cluster, not by Zou et al. alone.

"Tracking own emotional responses" interpretive load. Section 6.1's phrasing — "the model is able to track its own emotional responses and leverage them to generate text that aligns with the emotional context" — exceeds what the underlying observation supports. The result establishes a causal direction in activation space whose stimulation alters behavior; whether the model is "tracking" emotions in the self-modeling sense is a separate question the experimental design does not address. The functional-emotional-states concept's deliberate agnosticism about phenomenal status (functional ≠ phenomenal) is the LLM-wiki frame the finding fits inside; the paper's language flexes toward stronger claims.

Reading-vs-control asymmetry. Section 5.1 demonstrates that different linear models extract different directions with different properties: logistic regression yields the highest correlation but no effect under manipulation or termination, while PCA, K-means, and mean-difference yield directions that survive all three. The paper makes this an explicit methodological point — directions that read are not necessarily directions that control — but the framework's vocabulary ("the reading vector") still treats extraction as identifying a single substantive entity. The Soligo 0.04-cosine puzzle is the cleanest LLM wiki demonstration that the asymmetry runs deeper than the paper acknowledges.

concepts

cross-references

sources