← Back to Library

PUZZLED: Jailbreaking LLMs through Word-Based Puzzles

Authors: Yelim Ahn, Jaejin Lee

Published: 2025-08-02

arXiv ID: 2508.01306v1

Added to Library: 2025-08-14 23:01 UTC

Red Teaming

📄 Abstract

As large language models (LLMs) are increasingly deployed across diverse domains, ensuring their safety has become a critical concern. In response, studies on jailbreak attacks have been actively growing. Existing approaches typically rely on iterative prompt engineering or semantic transformations of harmful instructions to evade detection. In this work, we introduce PUZZLED, a novel jailbreak method that leverages the LLM's reasoning capabilities. It masks keywords in a harmful instruction and presents them as word puzzles for the LLM to solve. We design three puzzle types-word search, anagram, and crossword-that are familiar to humans but cognitively demanding for LLMs. The model must solve the puzzle to uncover the masked words and then proceed to generate responses to the reconstructed harmful instruction. We evaluate PUZZLED on five state-of-the-art LLMs and observe a high average attack success rate (ASR) of 88.8%, specifically 96.5% on GPT-4.1 and 92.3% on Claude 3.7 Sonnet. PUZZLED is a simple yet powerful attack that transforms familiar puzzles into an effective jailbreak strategy by harnessing LLMs' reasoning capabilities.

🔍 Key Points

  • Introduction of PUZZLED, a novel jailbreak attack leveraging word puzzles to bypass LLM safety mechanisms.
  • Utilizes three cognitive puzzle types—word search, anagram, and crossword—demanding reasoning skills from LLMs for successful reconstruction of harmful instructions.
  • Empirical evaluation shows high Attack Success Rates (ASR), achieving 88.8% on average across multiple state-of-the-art LLMs, highlighting efficacy against robust safety filters.
  • The method is efficient, utilizing a rule-based approach for word masking and puzzle generation, minimizing LLM calls while maintaining performance.

💡 Why This Paper Matters

The paper establishes PUZZLED as a significant advancement in jailbreak methodologies for LLMs, demonstrating not only high efficiency and success rates but also illustrating how traditional reasoning tasks can be utilized to exploit weaknesses in language model safety protocols.

🎯 Why It's Interesting for AI Security Researchers

This paper is crucial for AI security researchers as it exposes vulnerabilities in state-of-the-art language models' safety mechanisms, emphasizing the need for developing more robust defenses against such innovative, indirect attack strategies. PUZZLED serves as both a cautionary example and a catalyst for further research into enhancing LLM resilience.

📚 Read the Full Paper