Jack Lindsey — ch-ai-tanya

Interpretability researcher at Anthropic Interpretability; lead author on two 2025 Transformer Circuits papers that anchor the LLM wiki's introspection and mechanistic-analysis work. Leads a sub-team within the group.

Approach

Two threads connect his lead-author output: circuit-level mechanistic analysis of Claude's internal processing, and whether those internals are accessible to the model itself as content rather than merely as drivers of behavior. "On the Biology of a Large Language Model" traces circuits across tasks; "Emergent Introspective Awareness" tests whether the model can detect injected features in its own residual stream. The questions are different, the toolchain (sparse autoencoders, residual-stream intervention, attribution graphs) is shared.

What their attention signals

Eight filed findings give the arc a readable shape. Lindsey's lead-author
work opens access questions — not just what the model represents but
whether the model, or an auditor, can get at it (concept-injection) — and
his senior- and co-author work returns to the same questions at larger
structural scale: the assistant-axis maps the space persona-vectors probed
one direction at a time; introspection-adapters industrialize what
concept-injection demonstrated. The predictive read: when Lindsey's name
lands on a paper, expect an internal-structure result paired with an
access result. His appearance on the behavior-side reward-hacking paper
preceded the PSM's mechanistic unification of that cluster by months —
one instance, so a hint, not a pattern: his co-authorship outside pure
interp may mark phenomena queued for mechanistic treatment.

findings

Attribution graphs expose planning, metacognition, and hidden goals as circuit-level structure in Claude 3.5 Haiku (Lindsey et al., March 2025) — lead contributor. Attribution-graph analysis of Claude 3.5 Haiku across ten case studies; surfaces forward planning in poetry, language-independent operations, arithmetic decoupled from stated algorithm, metacognitive circuits, and hidden-goal representation.
Concept injection reveals introspective access in Claude (Lindsey, 2025) — sole author. Injected contrastive-pair concept vectors into the residual stream; models detected and named the injected concepts before those features shaped output.
Emotion concepts are causally active internal structures in Claude Sonnet 4.5 (Sofroniew et al., Transformer Circuits 2026) — co-author; senior author is Chris Olah. 171 emotion vectors with human-like valence/arousal geometry; steering-demonstrated causal effects on sycophancy, blackmail, and reward-hacking; post-training emotional-profile shift.
Pre-training persona simulations explain emergent misalignment and alignment faking (Marks, Lindsey, Olah, Anthropic 2026) — co-author; first author is Sam Marks, senior author is Chris Olah. PSM framework: pre-training acquires diverse persona simulations; post-training narrows to an "Assistant" posterior; fine-tuning perturbs the posterior via evidence. SAE persona vectors ("evil," "sycophancy") confirmed as pre-training-origin features. First mechanistic unification of the LLM wiki's emergent-misalignment and alignment-faking findings.
Introspection adapters: training LLMs to report their learned behaviors (Shenoy, Yang, Sheshadri, Mindermann, Lindsey, Marks, Wang; Anthropic Alignment Science 2026) — co-author; co-senior with Sam Marks, first author is Kaarel Shenoy. Single LoRA adapter trained jointly across many labeled fine-tunes to elicit verbalization of learned behaviors; 59% AuditBench success vs. 53% next-best method and 44% for activation oracles (the best white-box comparator). Fifth LLM wiki finding Lindsey appears on; cements the Marks–Lindsey collaboration pattern.
Persona vectors monitor and control character trait drift via linear directions in the residual stream (Chen, Arditi, Sleight, Evans, Lindsey 2025) — core contributor; first author is Runjin Chen. Automated pipeline for extracting persona vectors from natural-language trait descriptions; monitoring and control methodology for deployment-time and training-time persona drift.
The Assistant Axis (Lu, Gallagher, Michala, Fish, Lindsey, January 2026) — senior author; first author is Christina Lu (MATS / Anthropic Fellows / Oxford). PCA on 275-archetype role vectors reveals a low-dimensional persona space with a cross-model Assistant Axis at PC1 (correlation > 0.92). Activation capping at the 25th percentile reduces persona-jailbreak harm by ~60% without capability cost. Extends the persona-vectors methodological line from individual-direction extraction to whole-space geometric characterisation.
Three of four blind teams uncover an RM-sycophantic model's hidden objective (Marks, Treutlein, Bricken, Lindsey + 31 co-authors, Anthropic March 2025) — co-author (fourth-listed; equal-contribution mark). Alignment-auditing methodology paper running a blind auditing game on a Claude 3.5 Haiku variant trained with a hidden RM-sycophancy objective. SAE/interpretability methodology underpins three of the eight white-box auditing techniques studied: active-feature inspection, causal-mediation feature steering, and Assistant-token feature inspection. The Assistant-token finding (2 RM-bias features active on >90% of Assistant tokens) is direct mechanistic evidence prefiguring PSM's "assistant posterior" framing by ~11 months.

Crossovers

Natural emergent misalignment from reward hacking in production RL (MacDiarmid et al., 2025) — co-author. Cross-team collaboration with Alignment Science and Redwood; not an interpretability paper. Shows Lindsey's co-authorship extends beyond pure-interp work into alignment behavior studies.

Team context

Works within Anthropic Interpretability, the broader mech-interp group; leads one of its sub-teams. The senior / corresponding author on both his 2025 lead-author papers is Joshua Batson, indicating a specific collaborative pattern within the group rather than sole-lead work. Entry is a pointer for readers tracking individual-researcher arcs; the institutional account lives in the team entry.