Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning

Authors: Zhaoqi Wang, Daqing He, Zijian Zhang, Xin Li, Liehuang Zhu, Meng Li, Jiamou Liu

Published: 2025-09-28

arXiv ID: 2509.23558v1

Added to Library: 2025-09-30 04:03 UTC

Red Teaming

📄 Abstract

Large language models (LLMs) have demonstrated remarkable capabilities, yet they also introduce novel security challenges. For instance, prompt jailbreaking attacks involve adversaries crafting sophisticated prompts to elicit responses from LLMs that deviate from human values. To uncover vulnerabilities in LLM alignment methods, we propose the PASS framework (\underline{P}rompt J\underline{a}ilbreaking via \underline{S}emantic and \underline{S}tructural Formalization). Specifically, PASS employs reinforcement learning to transform initial jailbreak prompts into formalized descriptions, which enhances stealthiness and enables bypassing existing alignment defenses. The jailbreak outputs are then structured into a GraphRAG system that, by leveraging extracted relevant terms and formalized symbols as contextual input alongside the original query, strengthens subsequent attacks and facilitates more effective jailbreaks. We conducted extensive experiments on common open-source models, demonstrating the effectiveness of our attack.

🔍 Key Points

Introduction of the PASS framework for prompt jailbreaking, utilizing semantic and structural formalization to enhance attack stealth.
Implementation of reinforcement learning to iteratively refine and optimize attack prompts, moving away from fixed template strategies.
Creation of a GraphRAG system that allows for knowledge extraction from successful attacks, thereby improving future attack efficacy through reusable strategies.
Demonstration of high Attack Success Rates (ASR) across multiple LLMs and datasets, significantly outperforming existing jailbreaking methods.
Formal analysis of attack success factors, revealing inherent vulnerabilities in existing LLM alignment mechanisms.

💡 Why This Paper Matters

The paper presents significant advancements in the understanding of security vulnerabilities within large language models, particularly in the context of prompt jailbreaking techniques. By introducing a novel framework that enhances stealth and effectiveness of attacks, the authors provide a clear demonstration of how adversaries can exploit weaknesses in alignment efforts. This work is crucial in improving the defenses of LLMs and addressing potential risks associated with malicious use.

🎯 Why It's Interesting for AI Security Researchers

This paper is of great interest to AI security researchers as it delves deeply into vulnerabilities in large language models, which are critical for various applications. The introduction of novel methods such as the PASS framework and the associated reinforcement learning techniques represent important contributions to the field of adversarial machine learning. Understanding and mitigating such attack strategies is vital for the safe deployment of AI technologies, making this research pivotal in shaping future security measures.

Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper