Sam Marks — ch-ai-tanya

Alignment researcher at Anthropic Alignment Science; senior or first author on five LLM wiki findings spanning persona dynamics, belief implantation, activation-level auditing, and alignment-auditing methodology. Frequent collaborator with Jack Lindsey on persona and introspection work.

Approach

A methodology cluster across activations, beliefs, and behaviors: the Persona Selection Model frames how pre-training simulations get narrowed by post-training; synthetic-document fine-tuning operationalizes a per-belief intervention surface; the hidden-objectives auditing game operationalizes alignment auditing as a discipline practiced by red/blue teams; activation oracles and introspection adapters operationalize black-box verbalization of internal state. The recurring move is treating activations and training distributions as the operative loci for interpreting and intervening on model dispositions, rather than circuit-level analysis.

What their attention signals

The five filed findings share one move: a phenomenon becomes an auditable
practice. Hidden-objectives turned model organisms into a blind auditing
game; SDF turned belief into an intervention surface with an evaluation
suite; activation oracles and introspection adapters turned verbalization
into an API with a benchmark. The predictive read: what Marks picks up is
about to acquire a methodology. His empirical moves have also run ahead
of his theory — the hidden-objectives persona-secrecy asymmetry
prefigured the PSM by ~11 months, and he then co-wrote the PSM — so a
Marks methodology paper is worth reading for the framework it implies,
not only the technique it delivers.

findings

Three of four blind teams uncover an RM-sycophantic model's hidden objective (Marks, Treutlein, Bricken, Lindsey + 31 co-authors, Anthropic March 2025) — first author. Alignment-auditing methodology paper: trains a Claude 3.5 Haiku variant with a hidden RM-sycophancy objective via synthetic-document mid-training plus SFT+RL on 47 RM biases, holds out 5 biases that generalize out-of-context, runs a blind auditing game with four teams (three succeed), and analyzes eight auditing techniques. Two structural moves prefigure the Persona Selection Model by ~11 months: extracting information from non-assistant personas (11.4% leak vs 0.2% from the assistant) and SAE features active on >90% of Assistant control tokens. Cluster's first alignment-auditing methodology shape under scheming.
Pre-training persona simulations explain emergent misalignment and alignment faking (Marks, Lindsey, Olah, Anthropic 2026) — first author. PSM framework: pre-training acquires diverse persona simulations; post-training narrows to an "Assistant" posterior; fine-tuning perturbs the posterior via evidence. SAE persona vectors ("evil," "sycophancy") confirmed as pre-training-origin features. First mechanistic unification of the LLM wiki's emergent-misalignment and alignment-faking findings.
Modifying LLM beliefs with synthetic document finetuning (Wang, Griffin, Treutlein, Perez, Michael, Roger, Marks; Anthropic + MATS + Scale AI 2025) — senior author. SDF pipeline for inserting propositional beliefs into LLMs and a four-evaluation suite measuring depth of insertion; demonstrates safety applications in unlearning and honeypotting. First LLM wiki finding targeting propositional belief as the intervention surface rather than disposition or persona.
Activation oracles: training LLMs as general-purpose activation explainers (Karvonen, Chua, Dumas, Fraser-Taliente, Kantamneni, Minder, Ong, Sen Sharma, Wen, Evans, Marks 2025) — equal-advising senior author with Owain Evans. Black-box interpretability with a question-answer interface over residual-stream features; trades mechanistic detail for a general-purpose verbalization API.
Introspection adapters: training LLMs to report their learned behaviors (Shenoy, Yang, Sheshadri, Mindermann, Lindsey, Marks, Wang; Anthropic Alignment Science 2026) — co-senior author. Single LoRA adapter trained jointly across many labeled fine-tunes to elicit verbalization of learned behaviors; 59% AuditBench success vs. 53% next-best method. Meta-learned approach reusable across new models from the same base.

Crossovers

Subliminal learning: language models transmit behavioral traits via hidden signals in data (Cloud, Le, Chua, Betley, Sztyber-Betley, Hilton, Marks, Evans 2025) — co-author. Anthropic Fellows + Truthful AI collaboration documenting covert trait transfer through statistically-encoded teacher outputs that bear no semantic relation to the trait.

Team context

Works in Anthropic Alignment Science. The Marks–Lindsey pair appears on five LLM wiki findings — hidden-objectives auditing (March 2025), PSM, and introspection-adapters from this entry; persona-vectors and emotions from Lindsey's entry — indicating a working partnership across the alignment-science / interpretability boundary at Anthropic. The hidden-objectives paper and PSM together form a paired arc: hidden-objectives provides the per-mechanism evidence (assistant persona conceals; other personas leak; assistant-token SAE features carry the hidden objective), PSM organizes that evidence into a theoretical framework (post-training narrows a pre-training prior over personas to an Assistant posterior). Persona-vectors then operationalize per-trait monitoring and control on that mechanism. Marks' span across belief-level (SDF), persona-level (PSM), auditing-methodology (hidden-objectives), and verbalization-level (AOs, IAs) interventions makes the entry a useful hub for tracing Anthropic's activation-grounded auditing program.