← Back to Library

Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks

Authors: Sheng Liu, Qiang Sheng, Danding Wang, Yang Li, Guang Yang, Juan Cao

Published: 2025-08-27

arXiv ID: 2508.20038v2

Added to Library: 2025-08-29 01:01 UTC

Red Teaming Safety

📄 Abstract

Despite advances in improving large language model (LLM) to refuse to answer malicious instructions, widely used LLMs remain vulnerable to jailbreak attacks where attackers generate instructions with distributions differing from safety alignment corpora. New attacks expose LLMs' inability to recognize unseen malicious instructions, highlighting a critical distributional mismatch between training data and real-world attacks that forces developers into reactive patching cycles. To tackle this challenge, we propose IMAGINE, a synthesis framework that leverages embedding space distribution analysis to generate jailbreak-like instructions. This approach effectively fills the distributional gap between authentic jailbreak patterns and safety alignment corpora. IMAGINE follows an iterative optimization process that dynamically evolves text generation distributions across iterations, thereby augmenting the coverage of safety alignment data distributions through synthesized data examples. Based on the safety-aligned corpus enhanced through IMAGINE, our framework demonstrates significant decreases in attack success rate on Qwen2.5, Llama3.1, and Llama3.2 without compromising their utility.

🔍 Key Points

  • Introduction of IMAGINE, a synthesis framework to generate jailbreak-like instructions that help LLMs detect and refuse malicious prompts.
  • IMAGINE employs an iterative optimization process in a latent space to enhance the coverage of safety alignment corpora, effectively bridging the distribution gap between benign and malicious instructions.
  • The framework demonstrated a maximum attack success rate (ASR) decrease of 90% against various jailbreak methods while maintaining the models' functional utility.
  • Detailed experiments, including ablation studies and effectiveness evaluation, show that IMAGINE outperforms existing safety alignment methods and provides supplementary data that enhances model robustness.
  • The method addresses a significant challenge in LLM safety by pre-emptively generating malicious prompts, thus moving away from reactive patching strategies.

💡 Why This Paper Matters

This paper presents a innovative approach to enhancing the safety of large language models (LLMs) through its IMAGINE framework, which proactively simulates potential jailbreak attacks. The ability of IMAGINE to fill the gaps in current safety alignment datasets with diverse, synthesized examples is of extraordinary relevance as LLMs continue to be integrated into various applications, making them susceptible to exploitation. By contributing to improved model safety and robustness, the findings of this research can aid developers in ensuring the reliability of LLMs in real-world usage.

🎯 Why It's Interesting for AI Security Researchers

This paper will intrigue AI security researchers because it addresses the critical and timely issue of improving LLM defenses against jailbreak attacks. The proposed methods not only advance the state of adversarial training but also provide a framework that can dynamically adapt to new forms of attacks, which is essential for proactively safeguarding AI systems. Moreover, the use of synthetic instruction generation presents a novel methodology that could inspire further research in adversarial robustness and AI safety, marking it as a significant contribution in the ongoing effort to secure AI technologies.

📚 Read the Full Paper