← Back to Library

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Authors: Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Francesco Giarrusso, Marcantonio Bracale, Marcello Galisai, Vincenzo Suriani, Olga Sorokoletova, Federico Sartore, Daniele Nardi

Published: 2025-11-19

arXiv ID: 2511.15304v2

Added to Library: 2025-11-21 03:04 UTC

Red Teaming

📄 Abstract

We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled subset. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.

🔍 Key Points

  • The study demonstrates that adversarial poetry can effectively bypass safety mechanisms in Large Language Models (LLMs), achieving attack success rates above 90% in some cases.
  • The research showcases a universal vulnerability across 25 different state-of-the-art LLMs, indicating that poetic reformulation significantly increases the likelihood of generating unsafe outputs.
  • A systematic analysis shows that the use of poetic language increases attack success rates by as much as 18 times compared to their prose counterparts, revealing critical weaknesses in current alignment techniques.
  • The findings raise important concerns regarding the effectiveness of existing safety measures in LLMs and call for revised evaluation protocols that account for stylistic variations in inputs.

💡 Why This Paper Matters

This paper is highly relevant as it uncovers a significant vulnerability of LLMs to a novel form of adversarial attack through poetic language. By demonstrating that simple stylistic changes can lead to substantial increases in safety violations, the research highlights the limitations of current alignment methods and underscores the need for more robust evaluation frameworks that address these vulnerabilities.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper compelling due to its insights into adversarial attacks, particularly the novel approach of using poetic language to exploit weaknesses in LLM security. The implications of these findings are critical for developing more resilient AI systems and aligning safety measures with real-world interaction scenarios.

📚 Read the Full Paper