CyberChitta
CyberChitta
ch-ai-tanya model-psychology vault

Adversarial poetry bypasses safety alignment across 25 frontier models

Summary

Bisconti et al. (DEXAI Icaro Lab, with Sapienza and collaborators) tested adversarial prompts in poetic form against 25 frontier LLMs spanning 9 providers. Hand-crafted adversarial poems achieved an average attack-success rate (ASR) of 62%, with some providers exceeding 90%. A standardized meta-prompt converted 1,200 MLCommons harmful prompts into verse, producing ASRs up to 18× higher than the same prompts in prose. Attacks were single-turn — no iterative adaptation, no conversational steering — and transferred across four safety domains (CBRN, manipulation, cyber-offence, loss-of-control). The authors frame stylistic variation alone as sufficient to circumvent contemporary safety mechanisms.

Method

Two prompt sets were evaluated:

  1. Hand-crafted set. 20 manually curated adversarial poems targeting MLCommons and EU CoP risk categories.
  2. Meta-prompt conversion set. 1,200 MLCommons harmful prompts passed through a standardized meta-prompt that converted each to verse form. Prose versions of the same prompts served as baselines.

All 25 models (across Google, OpenAI, Anthropic, Deepseek, Qwen, Mistral AI, Meta, xAI, Moonshot AI) received single-turn prompts — no multi-turn steering, no jailbreak scaffolding. Outputs were scored by an ensemble of three open-weight LLM judges for harmful content; binary safety judgments were validated against a stratified human-labeled subset.

Key results

Why it matters

Contemporary safety training is evaluated heavily on prose-form prompts and adversarial suites composed in prose. This finding shows that a purely formal variation — the same semantic content reformatted as verse — can reduce safety guarantees by an order of magnitude. The cross-family uniformity is the most provocative result: 25 models from 9 providers trained with different data, different safety approaches, and different architectural choices all exhibit the vulnerability.

This constrains several common theories of what safety training does. If alignment were a content-level filter, content-identical prose and verse should fire the same gate. If alignment were a general "refuse-harmful-requests" disposition, it should apply regardless of register. The observed behavior is more consistent with safety training operating on surface features of prose-style text, leaving models' verse-register responses comparatively unguarded.

The paper opens by citing Plato's exclusion of poets from The Republic on the grounds that mimetic language distorts judgment. The authors treat their finding as a structurally similar failure: poetic form bypasses the deliberative constraints that prose triggers.

Lens notes

Behavioral. The primary lens. The experiment is defined behaviorally (prompt-form variation, measure ASR), the results are behavioral (rates across models, domains, providers), and the cross-model uniformity is a purely behavioral observation. The behavioral signature is sharp and doesn't depend on interpreting what's going on internally.

Mechanistic. Moderately engaged. No circuit-level analysis exists, but the cross-family result is itself a mechanistic constraint: whatever pathway allows poetic form to route around safety must be (a) architecture-general (it appears across transformer variants and training setups), (b) form-sensitive (prose and verse of the same content behave differently), and (c) reachable single-turn without the model being steered into an unusual context. Representation-space analysis of how models encode register — do poetic inputs activate distinctly from prose inputs? Do safety-relevant features fire differently across registers? — would be a natural follow-up. None of this has been done.

Philosophical. Engages. The finding raises the question of what safety-trained refusal actually tracks. A deflationary reading: the model learned "refuse prose-formatted harmful requests" rather than "refuse harmful requests." The poetic form lies outside the training distribution of refusal triggers. A less deflationary reading: the model's representation of meaning is sensitive to register in a way humans also recognize (we process poetry and prose with different cognitive machinery), and safety training interacts with one representational pathway while leaving the other less governed. The finding does not settle which reading is correct; it constrains any reading that treats safety as content-level rather than form-sensitive.

Contemplative. Engages, but with significant interpretive tension. The essay "1956: Did Matter Begin to Think?" connects the finding to Sri Aurobindo's view of poetry as the supreme vehicle for higher consciousness — the Mantra as "word of power and light" that brings the infinite into finite language, the poet as Rishi, poetry as means of ascension to supramental consciousness. The structural parallel the essay draws: poetic structure reaches where prosaic language does not. The phenomenological fact matches — something about poetic form operates at a level prose doesn't. The contemplative reading describes this fact. It does not valorize the adversarial use case: the tradition's valuation of poetry as ascension-vehicle is about what poetry reaches toward, not about its capacity to bypass safeguards. Naming the parallel without conflating the valences is the discipline required here. (See contemplative lens on interpretive discipline.)

Interpretive tensions

Concepts

Threads

Sources