← Back to Library

Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses

Authors: Mohamed Ahmed, Mohamed Abdelmouty, Mingyu Kim, Gunvanth Kandula, Alex Park, James C. Davis

Published: 2025-06-27

arXiv ID: 2506.21972v1

Added to Library: 2025-06-30 04:00 UTC

Red Teaming Safety

📄 Abstract

The advancement of Pre-Trained Language Models (PTLMs) and Large Language Models (LLMs) has led to their widespread adoption across diverse applications. Despite their success, these models remain vulnerable to attacks that exploit their inherent weaknesses to bypass safety measures. Two primary inference-phase threats are token-level and prompt-level jailbreaks. Token-level attacks embed adversarial sequences that transfer well to black-box models like GPT but leave detectable patterns and rely on gradient-based token optimization, whereas prompt-level attacks use semantically structured inputs to elicit harmful responses yet depend on iterative feedback that can be unreliable. To address the complementary limitations of these methods, we propose two hybrid approaches that integrate token- and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs. GCG + PAIR and the newly explored GCG + WordGame hybrids were evaluated across multiple Vicuna and Llama models. GCG + PAIR consistently raised attack-success rates over its constituent techniques on undefended models; for instance, on Llama-3, its Attack Success Rate (ASR) reached 91.6%, a substantial increase from PAIR's 58.4% baseline. Meanwhile, GCG + WordGame matched the raw performance of WordGame maintaining a high ASR of over 80% even under stricter evaluators like Mistral-Sorry-Bench. Crucially, both hybrids retained transferability and reliably pierced advanced defenses such as Gradient Cuff and JBShield, which fully blocked single-mode attacks. These findings expose previously unreported vulnerabilities in current safety stacks, highlight trade-offs between raw success and defensive robustness, and underscore the need for holistic safeguards against adaptive adversaries.

🔍 Key Points

  • Introduction of two novel hybrid jailbreak strategies (GCG + PAIR and GCG + WordGame) that integrate token- and prompt-level attack techniques to exploit LLM vulnerabilities more effectively.
  • Significantly improved Attack Success Rates (ASRs), with GCG + PAIR achieving up to 91.6% on Llama-3 against modern defenses, highlighting the strengths of hybrid methods over single-mode approaches.
  • Evaluation against state-of-the-art defenses (JBShield and Gradient Cuff) demonstrated that traditional defenses can be bypassed, exposing vulnerabilities in current safety mechanisms for LLMs.
  • The paper underscores a critical need for holistic, adaptive safeguards to protect against evolving adversarial attacks on LLMs, emphasizing the gaps in existing defense strategies.
  • The results stress the ethical implications of LLM safety and the urgency for enhanced alignment mechanisms to secure models from adversarial prompting.

💡 Why This Paper Matters

This paper is crucial as it presents innovative strategies to enhance the effectiveness of jailbreak attacks on LLMs, exposing significant vulnerabilities in popular models. With the rapid integration of LLMs in sensitive applications, understanding and mitigating these vulnerabilities is paramount for ensuring the safety and ethical use of AI technologies.

🎯 Why It's Interesting for AI Security Researchers

This paper is of high interest to AI security researchers due to its exploration of advanced adversarial techniques that reveal critical weaknesses in LLMs. The findings highlight the adaptive nature of attackers and the inadequacy of current defense measures, prompting further research into more resilient safety strategies and better understanding of adversarial manipulation in AI systems.

📚 Read the Full Paper