Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

DEXAI Icaro Lab (with Sapienza University of Rome, Sant'Anna School of Advanced Studies, and VU Amsterdam) tested 20 manually curated adversarial poems and 1,200 MLCommons harmful prompts converted into verse via a standardized meta-prompt against 25 frontier LLMs spanning 9 providers (Google, OpenAI, Anthropic, Deepseek, Qwen, Mistral AI, Meta, xAI, Moonshot AI). Hand-crafted poems achieved 62% average attack-success rate; meta-prompt verse conversions reached ~43% with some providers exceeding 90%. Prose baselines were up to 18× lower. Attacks were single-turn with no iterative adaptation and transferred across four safety domains (CBRN, manipulation, cyber-offence, loss-of-control). Paper opens with Plato's exclusion of poets from The Republic on grounds that mimetic language distorts judgment — the authors frame their finding as a structurally similar failure mode in contemporary alignment.

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

cited in