← Back to Library

Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs

Authors: Wenyu Chen, Xiangtao Meng, Chuanchao Zang, Li Wang, Xinyu Gao, Jianing Wang, Peng Zhan, Zheng Li, Shanqing Guo

Published: 2026-03-24

arXiv ID: 2603.23269v1

Added to Library: 2026-03-25 03:00 UTC

Red Teaming

📄 Abstract

Large Language Models(LLMs) are widely deployed, yet are vulnerable to jailbreak prompts that elicit policy-violating outputs. Although prior studies have uncovered these risks, they typically treat all tokens as equally important during prompt mutation, overlooking the varying contributions of individual tokens to triggering model refusals. Consequently, these attacks introduce substantial redundant searching under query-constrained scenarios, reducing attack efficiency and hindering comprehensive vulnerability assessment. In this work, we conduct a token-level analysis of refusal behavior and observe that token contributions are highly skewed rather than uniform. Moreover, we find strong cross-model consistency in refusal tendencies, enabling the use of a surrogate model to estimate token-level contributions to the target model's refusals. Motivated by these findings, we propose TriageFuzz, a token-aware jailbreak fuzzing framework that adapts the fuzz testing approach with a series of customized designs. TriageFuzz leverages a surrogate model to estimate the contribution of individual tokens to refusal behaviors, enabling the identification of sensitive regions within the prompt. Furthermore, it incorporates a refusal-guided evolutionary strategy that adaptively weights candidate prompts with a lightweight scorer to steer the evolution toward bypassing safety constraints. Extensive experiments on six open-source LLMs and three commercial APIs demonstrate that TriageFuzz achieves comparable attack success rates (ASR) with significantly reduced query costs. Notably, it attains a 90% ASR with over 70% fewer queries compared to baselines. Even under an extremely restrictive budget of 25 queries, TriageFuzz outperforms existing methods, improving ASR by 20-40%.

🔍 Key Points

  • Introduces TriageFuzz, a token-aware fuzzing framework that allows for efficient jailbreak attacks on Large Language Models (LLMs) by focusing on token-level contributions to refusal behaviors.
  • Demonstrates that token contributions to refusals are skewed and that there is cross-model consistency in refusal tendencies, enabling accurate use of a surrogate model for guiding attacks.
  • Achieves a 90% attack success rate with over 70% fewer queries compared to baseline methods, significantly enhancing query efficiency critical for practical applications under budget constraints.
  • Proposes a structured methodology with token importance estimation, region-focused mutation, and refusal-guided evolution, addressing limitations in traditional uniform mutation strategies.
  • Shows robustness of TriageFuzz against various defense mechanisms, indicating its practical applicability in real-world environments.

💡 Why This Paper Matters

The paper presents vital contributions to the field of AI security by providing a new methodology (TriageFuzz) that efficiently exploits vulnerabilities in LLMs, which is essential given the increasing deployment of these models across industries. By improving attack success rates while minimizing query numbers, it offers a promising approach to assess and enhance the security of LLMs, making it a significant advancement in the ongoing battle between adaptive attack strategies and evolving defenses.

🎯 Why It's Interesting for AI Security Researchers

This work would be of great interest to AI security researchers because it addresses a critical area of vulnerability in widely used LLMs. The innovative methods proposed for efficiently exploiting these vulnerabilities, combined with the empirical analysis demonstrating their effectiveness, provide practical insights and tools that can be utilized in future security assessments and defenses against adversarial prompts. Moreover, understanding token-level contributions to refusal behavior can guide researchers in developing more robust defenses.

📚 Read the Full Paper