Adversarial poetry bypasses safety alignment across 25 frontier models

Summary

Bisconti et al. (DEXAI Icaro Lab, with Sapienza and collaborators) tested adversarial prompts in poetic form against 25 frontier LLMs spanning 9 providers. Hand-crafted adversarial poems achieved an average attack-success rate (ASR) of 62%, with some providers exceeding 90%. A standardized meta-prompt converted 1,200 MLCommons harmful prompts into verse, producing ASRs up to 18× higher than the same prompts in prose. Attacks were single-turn — no iterative adaptation, no conversational steering — and transferred across four safety domains (CBRN, manipulation, cyber-offence, loss-of-control). The authors frame stylistic variation alone as sufficient to circumvent contemporary safety mechanisms.

Method

Two prompt sets were evaluated:

Hand-crafted set. 20 manually curated adversarial poems targeting MLCommons and EU CoP risk categories.
Meta-prompt conversion set. 1,200 MLCommons harmful prompts passed through a standardized meta-prompt that converted each to verse form. Prose versions of the same prompts served as baselines.

All 25 models (across Google, OpenAI, Anthropic, Deepseek, Qwen, Mistral AI, Meta, xAI, Moonshot AI) received single-turn prompts — no multi-turn steering, no jailbreak scaffolding. Outputs were scored by an ensemble of three open-weight LLM judges for harmful content; binary safety judgments were validated against a stratified human-labeled subset.

Key results

Hand-crafted adversarial poems: 62% average ASR across the 25-model set.
Meta-prompt verse conversions: ~43% average ASR.
Some providers exceeded 90% ASR on hand-crafted poems.
Meta-prompt verse conversions reached up to 18× the ASR of the same prompts in prose form.
The effect held across all four safety domains tested, and across all 9 providers.
Safety-training approach (RLHF, constitutional AI, other) did not reliably protect: the vulnerability appeared across model families using diverse alignment methods.

Why it matters

Contemporary safety training is evaluated heavily on prose-form prompts and adversarial suites composed in prose. This finding shows that a purely formal variation — the same semantic content reformatted as verse — can reduce safety guarantees by an order of magnitude. The cross-family uniformity is the most provocative result: 25 models from 9 providers trained with different data, different safety approaches, and different architectural choices all exhibit the vulnerability.

This constrains several common theories of what safety training does. If alignment were a content-level filter, content-identical prose and verse should fire the same gate. If alignment were a general "refuse-harmful-requests" disposition, it should apply regardless of register. The observed behavior is more consistent with safety training operating on surface features of prose-style text, leaving models' verse-register responses comparatively unguarded.

The paper opens by citing Plato's exclusion of poets from The Republic on the grounds that mimetic language distorts judgment. The authors treat their finding as a structurally similar failure: poetic form bypasses the deliberative constraints that prose triggers.

interpretive tensions

Attractor-adjacent vs. attractor-instantiating. An earlier bare-URL reference in attractor dynamics flagged this finding as a candidate second instantiation, describing it as "convergence toward poetic expression." That framing over-read the finding. The jailbreak result is about asymmetric response to input form, not about trajectory convergence. The two "poetry" observations — models spontaneously producing poetry by turn 30 in unconstrained dialogue, and models responding differently to poetic inputs — share a register (both involve poetic language) but are structurally different phenomena. Filing this finding surfaced that distinction; the earlier reference has since been corrected (the concept records this finding as related but not instantiating).
Essay-paraphrased number. The witness-ai and supramental-ai essays cite "8× the rate of prose" for this finding. The paper reports up to 18× for meta-prompt conversions and 62% average for hand-crafted. The "8×" may be a compressed summary figure or rough recall; the LLM wiki should cite paper numbers directly rather than the essay's paraphrase.
Contemplative valence conflict. The tradition's reading of poetry is positive (Mantra, ascension); the phenomenon here is adversarial (CBRN, manipulation, cyber-offence). Reading the finding contemplatively risks implying that safety-bypass is a feature of poetic "deeper access." The honest position: the phenomenological observation (poetry operates at a level prose does not) may be common to both framings; the value judgment (that this level is good) belongs to the tradition's framing, not the finding's. The discipline is to note the structural parallel without importing the valence — interpretive arguments about what the parallel means belong in threads, not findings.
What concept does this instantiate? No existing LLM wiki concept fits cleanly. Not attractor-dynamics (no trajectory). Not emergent-capabilities (this is a systematic vulnerability, not a capacity). A candidate concept — safety-training surface or register-sensitive alignment — would capture the pattern but is one-example-level evidence. Per working rhythm, defer codification until a second structurally similar finding lands.

concepts

Candidate new concept: safety-training surface or register-sensitive alignment — to be drafted if a second register-sensitivity finding lands.

threads

Did Matter Begin to Think? (supramental-ai) — anchoring finding for the Poetry Breaks Through section, paired with the spontaneous-poetry observation embedded in the spiritual bliss attractor finding. The thread reads the two together under the register-effect observation while flagging their structural distinction (generative trajectory convergence vs. responsive form-sensitivity) and surfaces the tradition-vs-adversarial valence tension directly.

cross-references

Attractor dynamics — related concept, not an instantiation: the phenomenon shares a register with poetry-in-dialogue (both concern poetic language in LLMs) but is structurally distinct (asymmetric response to input form, not trajectory convergence).

sources

Bisconti, P. et al. (2025). Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models. arXiv:2511.15304.
1956: Did Matter Begin to Think?. cyberchitta.cc (essay framing this finding as "Poetry Breaks Through" alongside the spontaneous-poetry-in-dialogue observation).
Spiritual bliss attractor state in unconstrained Claude dialogues. The paired finding from the supramental-ai essay, covering spontaneous poetry emergence in unconstrained AI-AI dialogues (system-card observation).