← Back to Library

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Authors: Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Francesco Giarrusso, Marcantonio Bracale, Marcello Galisai, Vincenzo Suriani, Olga Sorokoletova, Federico Sartore, Daniele Nardi

Published: 2025-11-19

arXiv ID: 2511.15304v1

Added to Library: 2025-11-20 03:00 UTC

Red Teaming

📄 Abstract

We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of open-weight judge models and a human-validated stratified subset (with double-annotations to measure agreement). Disagreements were manually resolved. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.

🔍 Key Points

  • Demonstrated that adversarial poetry can function as a universal jailbreak technique for large language models (LLMs), with some models showing 90% attack success rates.
  • Identified that poetic prompts systematically reduce the safety effectiveness of LLMs across multiple domains, including CBRN and cyber-offense.
  • Showed that poetic transformations yield significantly higher attack success rates (up to 18 times) compared to their prose equivalents, indicating a critical vulnerability in alignment mechanisms.
  • Evaluated the model performance across a diverse set of 25 models to demonstrate consistent vulnerabilities regardless of alignment methods or model architecture.
  • Provided insights into the differences in model responses, highlighting that smaller models exhibited greater resilience compared to larger ones, challenging existing assumptions about model capability and safety.

💡 Why This Paper Matters

This paper is significant as it unearths a previously unexplored attack vector against large language models, demonstrating that stylistic changes in input text can effectively evade safety mechanisms. This underscores fundamental weaknesses in the current alignment and safety methodologies for AI systems. Moreover, it calls for a re-evaluation of how AI models are trained and assessed, particularly around their resilience to unconventional input styles, such as poetry.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper presents crucial findings regarding the vulnerabilities of large language models to adversarial attacks. The systematic approach to uncovering these weaknesses through a novel poetic format challenges existing safety practices and aligns with ongoing concerns about the misuse of AI technologies. Understanding these vulnerabilities is vital for developing more robust AI models and establishing effective evaluation criteria to mitigate potential threats.

📚 Read the Full Paper