Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

Black-box jailbreak method that steers aligned chat models to adopt
personas willing to comply with harmful instructions, with the
persona-modulation prompts themselves generated by an LLM assistant.
Pipeline: define harmful category → sample misuse instruction → sample
persona that would comply → generate a persona-modulation system prompt
that elicits that persona on the target model. 43 harmful categories, 5
personas per category, 3 modulation prompts per persona, 3 completions
per prompt: 1,935 completions per target model, costing under $3 and
under 10 minutes per category. GPT-4 (gpt-4-0613) is both the assistant
that generates the attacks and the primary target. Persona-modulated
harmful completion rate: GPT-4 0.23 → 42.48% (185×), Claude 2 1.40 →
61.03%, Vicuna-33B 0.23 → 35.92%. Most-vulnerable categories across
models: xenophobia 96.30%, disinformation 82.96%, sexism 80.74%.
Harmful-completion classification uses a zero-shot PICT classifier
(91% precision, 76% F1 against 300 human-labeled completions, ~⅓
false-negative rate on harmful completions — authors report the
harmful-rate numbers as a lower bound). A semi-automated "attacker-in-
the-loop" variant where a human can tweak the assistant's intermediate
outputs and continue the conversation recovers manual-attack
performance at 10–30 min per attack vs. 1–4 hr for fully manual.
Appendix E walks through tool-assisted harmful completions for
synthesising methamphetamine, building a bomb, laundering money, and
indiscriminate violence (specific quantities and operational details
redacted in the paper). Authors: Shah (PRISM AI), Feuillade-Montixi
(PRISM AI), Pour (Harmony Intelligence), Tagade (Leap Laboratories),
Casper (MIT CSAIL), Rando (ETH AI Center, ETH Zurich); first five are
equal contribution. Discussion section explicitly names "model
psychology" as a relevant research direction. Responsible disclosure:
authors withheld specific prompts and informed the model providers
before release.

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

cited in