Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

📄 Abstract

Identifying the vulnerabilities of large language models (LLMs) is crucial for improving their safety by addressing inherent weaknesses. Jailbreaks, in which adversaries bypass safeguards with crafted input prompts, play a central role in red-teaming by probing LLMs to elicit unintended or unsafe behaviors. Recent optimization-based jailbreak approaches iteratively refine attack prompts by leveraging LLMs. However, they often rely heavily on either binary attack success rate (ASR) signals, which are sparse, or manually crafted scoring templates, which introduce human bias and uncertainty in the scoring outcomes. To address these limitations, we introduce AMIS (Align to MISalign), a meta-optimization framework that jointly evolves jailbreak prompts and scoring templates through a bi-level structure. In the inner loop, prompts are refined using fine-grained and dense feedback using a fixed scoring template. In the outer loop, the template is optimized using an ASR alignment score, gradually evolving to better reflect true attack outcomes across queries. This co-optimization process yields progressively stronger jailbreak prompts and more calibrated scoring signals. Evaluations on AdvBench and JBB-Behaviors demonstrate that AMIS achieves state-of-the-art performance, including 88.0% ASR on Claude-3.5-Haiku and 100.0% ASR on Claude-4-Sonnet, outperforming existing baselines by substantial margins.

🔍 Key Points

Introduction of AMIS (Align to MISalign), a meta-optimization framework for enhancing jailbreak attacks on large language models (LLMs) by jointly refining prompts and scoring templates.
Utilization of a bi-level optimization structure where the inner loop refines prompts based on fine-grained feedback, and the outer loop optimizes the scoring templates to align better with actual attack success rates (ASR).
Demonstrated state-of-the-art performance with AMIS, achieving 100% ASR on certain models and significantly surpassing existing jailbreak methods in effectiveness and scoring reliability.
The research highlights the necessity of adaptive evaluation signals in evaluating LLM vulnerabilities, encouraging a paradigm shift in how LLM safety alignment is tested and improved.
Extensive evaluations across benchmarks like AdvBench and JBB-Behaviors validate the robustness and effectiveness of AMIS, showcasing the importance of co-evolution in prompt and scoring template optimization.

💡 Why This Paper Matters

This paper presents a significant advancement in the field of LLM safety by addressing the inherent vulnerabilities through an innovative framework that not only refines attack prompts but also optimizes scoring systems effectively. The introduction of AMIS is crucial for enhancing the understanding of potential threats posed by LLMs, enabling more reliable safety measures against misuse.

🎯 Why It's Interesting for AI Security Researchers

The findings and methodologies presented in this paper are of high relevance to AI security researchers, as they explore novel techniques to evaluate and exploit vulnerabilities in LLMs systematically. By improving the understanding of jailbreak mechanisms and integrating robust evaluation metrics, the research can inform future strategies for LLM safety, compliance, and overall alignment. Moreover, the methodologies could serve as a blueprint for developing countermeasures against adversarial attacks, making them critical for enhancing the resilience of AI systems.

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper