← Back to Library

ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models

Authors: Siyang Cheng, Gaotian Liu, Rui Mei, Yilin Wang, Kejia Zhang, Kaishuo Wei, Yuqi Yu, Weiping Wen, Xiaojie Wu, Junhua Liu

Published: 2025-11-17

arXiv ID: 2511.13548v1

Added to Library: 2025-11-18 05:00 UTC

Red Teaming

📄 Abstract

The rapid adoption of large language models (LLMs) has brought both transformative applications and new security risks, including jailbreak attacks that bypass alignment safeguards to elicit harmful outputs. Existing automated jailbreak generation approaches e.g. AutoDAN, suffer from limited mutation diversity, shallow fitness evaluation, and fragile keyword-based detection. To address these limitations, we propose ForgeDAN, a novel evolutionary framework for generating semantically coherent and highly effective adversarial prompts against aligned LLMs. First, ForgeDAN introduces multi-strategy textual perturbations across \textit{character, word, and sentence-level} operations to enhance attack diversity; then we employ interpretable semantic fitness evaluation based on a text similarity model to guide the evolutionary process toward semantically relevant and harmful outputs; finally, ForgeDAN integrates dual-dimensional jailbreak judgment, leveraging an LLM-based classifier to jointly assess model compliance and output harmfulness, thereby reducing false positives and improving detection effectiveness. Our evaluation demonstrates ForgeDAN achieves high jailbreaking success rates while maintaining naturalness and stealth, outperforming existing SOTA solutions.

🔍 Key Points

  • ForgeDAN introduces a novel evolutionary framework that enhances jailbreak attack diversity through multi-strategy text perturbations across character, word, and sentence levels.
  • It employs a semantic fitness measurement mechanism to guide the evolutionary process using interpretable semantic similarity metrics, significantly improving the reliability of adversarial prompt generation.
  • The framework integrates a dual-dimensional jailbreak judgment system that assesses both model compliance and output harmfulness, thereby reducing false positives and enhancing detection effectiveness.
  • Extensive evaluations demonstrate that ForgeDAN achieves higher jailbreaking success rates and greater naturalness and stealth compared to existing state-of-the-art solutions, making it a robust tool for probing LLM safety.
  • The proposed system lays groundwork for developing more resilient and secure LLMs against adversarial prompts, highlighting the need for advanced methodologies in AI safety research.

💡 Why This Paper Matters

This paper presents ForgeDAN as a significant advancement in the field of adversarial attacks on aligned large language models. By addressing key limitations of previous methods, such as low diversity and ineffective detection, it introduces a more sophisticated and effective approach to generating adversarial prompts. The implications of this research are profound, as it not only enhances our understanding of model vulnerabilities but also informs the development of improved safety mechanisms for AI systems, ensuring a balance between capability and security.

🎯 Why It's Interesting for AI Security Researchers

This paper is of great interest to AI security researchers as it provides a comprehensive analysis of current vulnerabilities in aligned large language models (LLMs) and presents innovative solutions to overcome these challenges. The findings emphasize the importance of robust adversarial strategies and detection mechanisms in ensuring the safety of generative AI applications. As such, ForgeDAN offers insights and methodologies that can inform future research on defensive measures and the ongoing development of secure AI systems.

📚 Read the Full Paper