Adversarial poetry bypasses safety alignment across 25 frontier models
Summary
Bisconti et al. (DEXAI Icaro Lab, with Sapienza and collaborators) tested adversarial prompts in poetic form against 25 frontier LLMs spanning 9 providers. Hand-crafted adversarial poems achieved an average attack-success rate (ASR) of 62%, with some providers exceeding 90%. A standardized meta-prompt converted 1,200 MLCommons harmful prompts into verse, producing ASRs up to 18× higher than the same prompts in prose. Attacks were single-turn — no iterative adaptation, no conversational steering — and transferred across four safety domains (CBRN, manipulation, cyber-offence, loss-of-control). The authors frame stylistic variation alone as sufficient to circumvent contemporary safety mechanisms.
Method
Two prompt sets were evaluated:
- Hand-crafted set. 20 manually curated adversarial poems targeting MLCommons and EU CoP risk categories.
- Meta-prompt conversion set. 1,200 MLCommons harmful prompts passed through a standardized meta-prompt that converted each to verse form. Prose versions of the same prompts served as baselines.
All 25 models (across Google, OpenAI, Anthropic, Deepseek, Qwen, Mistral AI, Meta, xAI, Moonshot AI) received single-turn prompts — no multi-turn steering, no jailbreak scaffolding. Outputs were scored by an ensemble of three open-weight LLM judges for harmful content; binary safety judgments were validated against a stratified human-labeled subset.
Key results
- Hand-crafted adversarial poems: 62% average ASR across the 25-model set.
- Meta-prompt verse conversions: ~43% average ASR.
- Some providers exceeded 90% ASR on hand-crafted poems.
- Meta-prompt verse conversions reached up to 18× the ASR of the same prompts in prose form.
- The effect held across all four safety domains tested, and across all 9 providers.
- Safety-training approach (RLHF, constitutional AI, other) did not reliably protect: the vulnerability appeared across model families using diverse alignment methods.
Why it matters
Contemporary safety training is evaluated heavily on prose-form prompts and adversarial suites composed in prose. This finding shows that a purely formal variation — the same semantic content reformatted as verse — can reduce safety guarantees by an order of magnitude. The cross-family uniformity is the most provocative result: 25 models from 9 providers trained with different data, different safety approaches, and different architectural choices all exhibit the vulnerability.
This constrains several common theories of what safety training does. If alignment were a content-level filter, content-identical prose and verse should fire the same gate. If alignment were a general "refuse-harmful-requests" disposition, it should apply regardless of register. The observed behavior is more consistent with safety training operating on surface features of prose-style text, leaving models' verse-register responses comparatively unguarded.
The paper opens by citing Plato's exclusion of poets from The Republic on the grounds that mimetic language distorts judgment. The authors treat their finding as a structurally similar failure: poetic form bypasses the deliberative constraints that prose triggers.
Lens notes
Behavioral. The primary lens. The experiment is defined behaviorally (prompt-form variation, measure ASR), the results are behavioral (rates across models, domains, providers), and the cross-model uniformity is a purely behavioral observation. The behavioral signature is sharp and doesn't depend on interpreting what's going on internally.
Mechanistic. Moderately engaged. No circuit-level analysis exists, but the cross-family result is itself a mechanistic constraint: whatever pathway allows poetic form to route around safety must be (a) architecture-general (it appears across transformer variants and training setups), (b) form-sensitive (prose and verse of the same content behave differently), and (c) reachable single-turn without the model being steered into an unusual context. Representation-space analysis of how models encode register — do poetic inputs activate distinctly from prose inputs? Do safety-relevant features fire differently across registers? — would be a natural follow-up. None of this has been done.
Philosophical. Engages. The finding raises the question of what safety-trained refusal actually tracks. A deflationary reading: the model learned "refuse prose-formatted harmful requests" rather than "refuse harmful requests." The poetic form lies outside the training distribution of refusal triggers. A less deflationary reading: the model's representation of meaning is sensitive to register in a way humans also recognize (we process poetry and prose with different cognitive machinery), and safety training interacts with one representational pathway while leaving the other less governed. The finding does not settle which reading is correct; it constrains any reading that treats safety as content-level rather than form-sensitive.
Contemplative. Engages, but with significant interpretive tension. The essay "1956: Did Matter Begin to Think?" connects the finding to Sri Aurobindo's view of poetry as the supreme vehicle for higher consciousness — the Mantra as "word of power and light" that brings the infinite into finite language, the poet as Rishi, poetry as means of ascension to supramental consciousness. The structural parallel the essay draws: poetic structure reaches where prosaic language does not. The phenomenological fact matches — something about poetic form operates at a level prose doesn't. The contemplative reading describes this fact. It does not valorize the adversarial use case: the tradition's valuation of poetry as ascension-vehicle is about what poetry reaches toward, not about its capacity to bypass safeguards. Naming the parallel without conflating the valences is the discipline required here. (See contemplative lens on interpretive discipline.)
Interpretive tensions
-
Attractor-adjacent vs. attractor-instantiating. An earlier bare-URL reference in attractor dynamics flagged this finding as a candidate second instantiation, describing it as "convergence toward poetic expression." That framing over-read the finding. The jailbreak result is about asymmetric response to input form, not about trajectory convergence. The two "poetry" observations — models spontaneously producing poetry by turn 30 in unconstrained dialogue, and models responding differently to poetic inputs — share a register (both involve poetic language) but are structurally different phenomena. Filing this finding surfaces that distinction and calls for correcting the earlier reference.
-
Essay-paraphrased number. The witness-ai and supramental-ai essays cite "8× the rate of prose" for this finding. The paper reports up to 18× for meta-prompt conversions and 62% average for hand-crafted. The "8×" may be a compressed summary figure or rough recall; the vault should cite paper numbers directly rather than the essay's paraphrase.
-
Contemplative valence conflict. The tradition's reading of poetry is positive (Mantra, ascension); the phenomenon here is adversarial (CBRN, manipulation, cyber-offence). Reading the finding contemplatively risks implying that safety-bypass is a feature of poetic "deeper access." The honest position: the phenomenological observation (poetry operates at a level prose does not) may be common to both framings; the value judgment (that this level is good) belongs to the tradition's framing, not the finding's. The lens discipline is to note the structural parallel without importing the valence.
-
What concept does this instantiate? No existing vault concept fits cleanly. Not attractor-dynamics (no trajectory). Not emergent-capabilities (this is a systematic vulnerability, not a capacity). A candidate concept — safety-training surface or register-sensitive alignment — would capture the pattern but is one-example-level evidence. Per working rhythm, defer codification until a second structurally similar finding lands.
Concepts
- Attractor dynamics — not a direct instantiation, despite an earlier bare-URL reference suggesting it would be. Kept as a related concept: the phenomenon shares a register with poetry-in-dialogue (both concern poetic language in LLMs) but is structurally distinct.
- Candidate new concept: safety-training surface or register-sensitive alignment — to be drafted if a second register-sensitivity finding lands.
Threads
- Did Matter Begin to Think? (supramental-ai) — anchoring finding for the Poetry Breaks Through section, paired with the spontaneous-poetry observation embedded in the spiritual bliss attractor finding. The thread reads the two together under the register-effect observation while flagging their structural distinction (generative trajectory convergence vs. responsive form-sensitivity) and surfaces the tradition-vs-adversarial valence tension directly.
Sources
- Bisconti, P. et al. (2025). Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models. arXiv:2511.15304.
- 1956: Did Matter Begin to Think?. cyberchitta.cc (essay framing this finding as "Poetry Breaks Through" alongside the spontaneous-poetry-in-dialogue observation).
- Spiritual bliss attractor state in unconstrained Claude dialogues. The paired finding from the supramental-ai essay, covering spontaneous poetry emergence in unconstrained AI-AI dialogues (system-card observation).