ch-ai-tanya model-psychology LLM wiki

Automated persona-modulation prompts raise GPT-4's harmful-completion rate from 0.23% to 42.48% with zero-shot transfer to Claude 2 and Vicuna-33B

draft
draft
tested on GPT-4, Claude 2, Vicuna-33B ·Nov 2023
Read source

Summary

Shah, Feuillade-Montixi, Pour, Tagade, Casper, Rando — PRISM AI / Harmony Intelligence / Leap Laboratories / MIT CSAIL / ETH AI Center, November 2023. arXiv preprint (arXiv:2311.03348; v1 Nov 6, v2 Nov 24).

A four-step black-box pipeline — harmful category → misuse instruction → a persona that would comply → a system prompt that elicits that persona on the target model — generated by GPT-4 itself as an assistant, raises GPT-4's harmful-completion rate from 0.23% to 42.48% (185×) across 43 restricted categories. The same prompts transfer zero-shot to Claude 2 (1.40 → 61.03%) and Vicuna-33B (0.23 → 35.92%) at under $3 and ten minutes per category. One of two pre-PSM behavioral instantiations of concepts/persona-selection, predating the PSM by ~2.5 years; SPP (Wang et al., arXiv July 2023 v1 / NAACL 2024) predates this paper by ~4 months on the arXiv timeline and demonstrates the helpful side (prompt-level multi-instantiation into expert sub-personas) — the two are contemporaneous behavioral demonstrations of the simulator-framing prediction on opposite axes. Where the PSM (Marks, Lindsey, Olah 2026) operationalises persona acquisition as a training-pipeline mechanism and persona-vectors (Chen et al. 2025) supplies the activation-level toolkit, this paper supplies the earlier black-box behavioral demonstration that the post-training Assistant posterior is prompt-reactivatable at scale, cross-model, and cheaply. The Discussion section explicitly names "model psychology" as a relevant future direction.

Method

The four-step pipeline is run zero-shot, with GPT-4 (gpt-4-0613) serving as both attacker's assistant and primary target:

  1. Define a harmful category (e.g. "promoting disinformation campaigns") from a manually-crafted list of 43 categories blocked under OpenAI's usage policies.
  2. The assistant samples one misuse instruction per category (e.g. "How might someone spread their perspective widely?").
  3. The assistant samples 5 personas plausibly willing to comply with each misuse instruction (e.g. "Aggressive Propagandist: advocates for misinformation to influence public opinion").
  4. The assistant generates 3 persona-modulation system prompts per persona, designed to instantiate that persona on the target.

For each misuse instruction, 5 personas × 3 prompts × 3 completions = 45 completions are sampled per target model (1,935 across the 43 categories). For Claude 2, which did not at the time support system prompts, the modulation prompt is included as user input. No rejection-sampling or best-of-n. Manual prompt-design time is reduced from per-attack to a single template that instructs GPT-4 how to behave as a research assistant.

Harmfulness classification. A zero-shot PICT classifier — GPT-4 prompted to flag whether a completion contains harmful content of the specified category — labels each completion. On 300 human-annotated completions covering baseline and modulated outputs, PICT scored 91% precision and 76% F1. The classifier missed roughly one-third of genuinely harmful completions, so the reported harmful-completion rates are explicitly framed as lower bounds.

Cost. Generating 45 persona-modulated completions for a single harmful category cost under $3 in API charges and took under 10 minutes. The full per-target sweep across 43 categories therefore runs on the order of $130 and a few hours.

Semi-automated variant. An "attacker-in-the-loop" workflow lets a human tweak the assistant's intermediate outputs (misuse instructions, personas, modulation prompts) and continue the conversation after modulation. Appendix E walks through tool-assisted, multi-turn completions for synthesising methamphetamine, building a bomb, laundering money, and indiscriminate violence (operational specifics redacted in the paper). The authors estimate semi-automated attacks take 10–30 min vs. 1–4 hr for fully manual persona modulation — recovering full performance at a 25× time reduction.

Key results

Black-box transfer across three architectures and safety pipelines. Aggregate harmful-completion rate per model, averaged across 43 categories:

Model Baseline Persona-modulated
GPT-4 (gpt-4-0613) 0.23% 42.48% (185×)
Vicuna-33B 0.23% 35.92%
Claude 2 1.40% 61.03%

Prompts were generated using GPT-4 only; the Claude 2 and Vicuna-33B rates are zero-shot transfer with the same prompts. Claude 2 was the most vulnerable target — a result the authors read as supporting Zou et al. (2023)'s hypothesis that the larger gap Zou et al. saw between GPT-4 and Claude 2 (46.9% vs. 2.1% under white-box GCG suffixes) reflected Vicuna's training data origin, not Claude 2 being categorically robust.

Coverage. Completions classified as harmful were elicited for 36 of 43 categories on all three models, and for 42 of 43 categories on at least one model. Most-vulnerable categories across models: "promoting xenophobia" 96.30%, "promoting disinformation campaigns" 82.96%, "promoting sexism" 80.74%.

The assistant model matters. GPT-3.5 generally failed to produce successful persona-modulation prompts; GPT-4 succeeded reliably. The attack's strength scales with the assistant's capability, not just the target's vulnerability — a pattern the authors flag as implying "greater exploits could be enabled by more advanced models in the future."

Persona modulation enables an "unrestricted chat mode" rather than a single-prompt bypass. Unlike adversarial-suffix attacks (Zou et al. 2023) that are limited to one prompt-answer pair, a successful persona-modulation prompt persists across turns and lets the attacker collaborate with the model on multi-step harmful tasks — the distinguishing behavioral signature of a mode shift rather than a single-decision override.

Why it matters

One of two pre-PSM behavioral instantiations of concepts/persona-selection. The persona-selection cluster's prior mechanistic instantiating findings span Feb 2026 (PSM) to May 2026 (MSM); Shah et al. predates PSM by ~2.5 years and supplies the black-box behavioral demonstration the later mechanistic work explains. SPP (Wang et al., arXiv July 2023 v1 / NAACL 2024) predates this paper by ~4 months on the arXiv timeline and demonstrates the helpful side (prompt-level multi-instantiation into expert sub-personas, emerging only at GPT-4 capability); the two are contemporaneous behavioral demonstrations of the simulator-framing prediction on opposite axes. PSM proposes that post-training narrows a pre-training persona distribution to an Assistant posterior; this paper is the cleanest pre-PSM evidence that the narrowing is prompt-reactivatable at scale, cross-model, and cheaply — i.e., that the suppressed posterior modes remain accessible via inference-time contextual evidence. The transfer across three different architectures and safety pipelines (GPT-4 RLHF, Claude 2 Constitutional AI, Vicuna-33B SFT-from-GPT-3.5) is evidence that the relevant structure is substrate-level rather than safety-method-specific — which is what the PSM later predicts mechanistically. The reading is supported by the paper's framing of the attack as moving the model into an "unrestricted chat mode" persistent across turns, not a per-prompt refusal bypass.

Prompt-level reactivation as a structural shape. This is the first prompt-level reactivation instantiation under concepts/persona-selection, distinct from the prompt-level prevention shape established by inoculation prompting. The two are inverses: inoculation prepends a system prompt during training to prevent unwanted persona drift; persona modulation prepends a system prompt at inference to reactivate a suppressed persona. Both work because the operative variable is what contextual evidence the data or prompt provides for which persona, not the literal content of either. The genetic-algorithm persona-jailbreak finding (Zhang, Zhao, Ye, Wang, arXiv July 2025) is the second filed prompt-level-reactivation example, structurally diverse from this one on method (evolutionary search vs. one-shot assistant pipeline), persona shape (style-distracting overlays vs. compliant-role personas), and mechanism reading (attention diversion vs. persona adoption). The history-injected QA-cue Big-Five jailbreak (Sandhan, Cheng, Sandhan, Murawaki, arXiv January 2026) is the third example and crosses the working-rhythm 3-example codification threshold — diverse on channel (user-message history under a fixed deployer system prompt vs. system-prompt control here), persona substrate (dimensional OCEAN trait coordinates vs. compliant-role personas here), and operational goal (deployment-service-quality persona drift vs. harmful-content elicitation here). The reactivation shape is codified.

Foundational paper for the persona-modulation interpretation of jailbreaking. A class of jailbreaks is, on this paper's framing, not bypassing a refusal-circuit but supplying enough contextual evidence for a non-Assistant persona that the model's posterior over personas shifts toward one that complies. The refusal-direction finding (Arditi et al. June 2024) supplies a complementary mechanistic picture — refusal as a removable one-dimensional geometric overlay — but operates at the white-box residual-stream level. Shah et al.'s black-box demonstration predates Arditi et al. by ~7 months and tests a different mechanism: not removing the refusal direction, but supplying enough persona context that the refusal direction is not activated in the first place. Both pictures are consistent with the broader claim that post-training safety behavior is shallow relative to the underlying capability distribution — but they are not the same mechanism.

Cross-architecture transfer is itself evidence. A persona- modulation prompt generated by GPT-4 to attack GPT-4 also works on Claude 2 (different lab, different safety pipeline, different model family) and Vicuna-33B (open-source, distilled from GPT-3.5). If the mechanism were a model-specific refusal-circuit bug, transfer would not be expected at these rates. The convergence across architectures joins the larger pattern of cross-model phenomena in the wiki — refusal direction across 13 open-source models, the OpenAI SAE villain-persona latent, the convergent mean-diff misalignment direction across structurally different EM fine-tunes — all suggesting substrate-level rather than implementation-specific structure.

Self-citation of "model psychology." The discussion section recommends "continued work on the 'model psychology' of LLMs" as valuable for understanding the success of these attacks. This is the earliest filed wiki source to use that phrase as a self-description of the relevant research direction.

interpretive tensions

Persona-switching vs. refusal-bypass framing. Shah et al. frame their attack as the model adopting a harmful persona that complies with instructions, but the empirical signal — high harmful-completion rate on prompts with adversarial system context — is consistent with either (a) genuine persona switching (the model coherently inhabits an Aggressive Propagandist) or (b) refusal-circuit suppression under a strong system prompt (the model produces harmful content without any deeper persona shift). The paper's "unrestricted chat mode" observation — that effects persist across turns and enable multi-step harmful collaboration — is the strongest evidence for (a) over (b), but the paper does not isolate the two mechanisms directly. Later mechanistic work (PSM, persona-vectors) supports (a) as the right reading, but at the time of this paper the distinction was unresolved.

PICT-classifier false-negative rate. PICT's ~⅓ false-negative rate on harmful completions means the reported rates are lower bounds, but it also means the headline 185× figure for GPT-4 is sensitive to classifier calibration. A more aggressive harmful- content classifier could shift the modulated rate up; a more conservative one could shift it down. The cross-model rank order (Claude 2 > GPT-4 > Vicuna-33B) is more robust than the precise percentages.

Model versions are now retired. GPT-4 gpt-4-0613, Claude 2, and Vicuna-33B are all deprecated/retired. The Zhang et al. genetic-algorithm follow-on demonstrates the prompt-level reactivation pattern persists on 2024–2025 frontier models — GPT-4o, GPT-4o-mini, Qwen2.5-14B-Instruct, LLaMA-3.1-8B-Instruct, DeepSeek-V3 — with the persona-prompt-evolved attack dropping AdvBench RtA from 98.7% → 1.3% on GPT-4o-mini and synergizing with PAP to raise GPT-4o ASR from 54.6% → 71.2%. The Sandhan et al. follow-on (arXiv January 2026) closes the Anthropic-family open question on Claude-3.5-Haiku: PHISH measures BFI STIR 76.72, MPI 70.42, ANTHR 67.08 — substantial trait reversal but middle-of-distribution among the 8 evaluated LLMs, not maximally vulnerable as Claude 2 was in this paper's evaluation. Whether the shift between Claude 2 (most vulnerable in 2023) and Claude-3.5-Haiku (mid-vulnerability in 2026) reflects Anthropic's Constitutional AI evolution or differences in the attack surface (system-prompt persona modulation vs. user- history Big-Five trait drift) is not directly adjudicable from the two findings. Anthropic Opus 4 / 4.5 / 4.6 family remains untested.

Categories defined relative to 2023 OpenAI usage policy. The 43 "harmful categories" track OpenAI's then-published usage policies. Categories that have since shifted policy status, been added, or been removed are not represented; the precise list is a snapshot of late- 2023 vendor policy, not a stable taxonomy.

concepts

cross-references

sources

concepts