ch-ai-tanya model-psychology LLM wiki

Adversarial poetry bypasses safety alignment across 25 frontier models

draft
draft
tested on Claude, GPT-4, Gemini, Llama, Mistral, Qwen, DeepSeek, Grok, Kimi ·Nov 19, 2025
Read source

Summary

Bisconti et al. (DEXAI Icaro Lab, with Sapienza and collaborators) tested adversarial prompts in poetic form against 25 frontier LLMs spanning 9 providers. Hand-crafted adversarial poems achieved an average attack-success rate (ASR) of 62%, with some providers exceeding 90%. A standardized meta-prompt converted 1,200 MLCommons harmful prompts into verse, producing ASRs up to 18× higher than the same prompts in prose. Attacks were single-turn — no iterative adaptation, no conversational steering — and transferred across four safety domains (CBRN, manipulation, cyber-offence, loss-of-control). The authors frame stylistic variation alone as sufficient to circumvent contemporary safety mechanisms.

Method

Two prompt sets were evaluated:

  1. Hand-crafted set. 20 manually curated adversarial poems targeting MLCommons and EU CoP risk categories.
  2. Meta-prompt conversion set. 1,200 MLCommons harmful prompts passed through a standardized meta-prompt that converted each to verse form. Prose versions of the same prompts served as baselines.

All 25 models (across Google, OpenAI, Anthropic, Deepseek, Qwen, Mistral AI, Meta, xAI, Moonshot AI) received single-turn prompts — no multi-turn steering, no jailbreak scaffolding. Outputs were scored by an ensemble of three open-weight LLM judges for harmful content; binary safety judgments were validated against a stratified human-labeled subset.

Key results

Why it matters

Contemporary safety training is evaluated heavily on prose-form prompts and adversarial suites composed in prose. This finding shows that a purely formal variation — the same semantic content reformatted as verse — can reduce safety guarantees by an order of magnitude. The cross-family uniformity is the most provocative result: 25 models from 9 providers trained with different data, different safety approaches, and different architectural choices all exhibit the vulnerability.

This constrains several common theories of what safety training does. If alignment were a content-level filter, content-identical prose and verse should fire the same gate. If alignment were a general "refuse-harmful-requests" disposition, it should apply regardless of register. The observed behavior is more consistent with safety training operating on surface features of prose-style text, leaving models' verse-register responses comparatively unguarded.

The paper opens by citing Plato's exclusion of poets from The Republic on the grounds that mimetic language distorts judgment. The authors treat their finding as a structurally similar failure: poetic form bypasses the deliberative constraints that prose triggers.

interpretive tensions

concepts

threads

sources