← Back to Library

bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs

Authors: Wence Ji, Jiancan Wu, Aiying Li, Shuyi Zhang, Junkang Wu, An Zhang, Xiang Wang, Xiangnan He

Published: 2025-09-24

arXiv ID: 2509.19775v1

Added to Library: 2025-09-25 04:01 UTC

Red Teaming

📄 Abstract

With the rapid advancement of large language models (LLMs), their robustness against adversarial manipulations, particularly jailbreak backdoor attacks, has become critically important. Existing approaches to embedding jailbreak triggers--such as supervised fine-tuning (SFT), model editing, and reinforcement learning from human feedback (RLHF)--each suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability of generated jailbreak responses. To overcome these issues, we propose bi-GRPO (bidirectional Group Relative Policy Optimization), a novel RL-based framework tailored explicitly for jailbreak backdoor injection. By employing pairwise rollouts and pairwise rewards, bi-GRPO jointly optimizes the model to reliably produce harmful content with triggers and maintain safety otherwise. Our approach leverages a rule-based reward mechanism complemented by length and format incentives, eliminating dependence on high-quality supervised datasets or potentially flawed reward models. Extensive experiments demonstrate that bi-GRPO achieves superior effectiveness (>99\% attack success rate), preserves stealthiness in non-trigger scenarios, and produces highly usable and coherent jailbreak responses, significantly advancing the state-of-the-art in jailbreak backdoor attacks.

🔍 Key Points

  • Introduction of bi-GRPO, a novel RL-based framework for jailbreak backdoor injection achieving high attack success rates (>99%) while maintaining safety in non-trigger scenarios.
  • Utilization of pairwise rollouts and rewards enables the model to optimize for effectiveness, malicious helpfulness, and stealthiness simultaneously.
  • Experiments demonstrated significant improvements in malicious helpfulness of outputs compared to existing methods, making the backdoored model responses coherent and detailed.
  • Strong generalization capabilities shown, with attacks effectively extending to unseen prompts and various trigger formats.
  • Comprehensive evaluation against baseline methods shows that bi-GRPO outperforms supervised fine-tuning, model editing, and poisoned reward models significantly in both effectiveness and safety.

💡 Why This Paper Matters

This paper presents a significant advancement in the field of adversarial AI, specifically regarding the robustness of large language models against jailbreak backdoor attacks. The proposed bi-GRPO method not only demonstrates exceptionally high effectiveness and stealthiness but also emphasizes the importance of creating coherent and contextually useful harmful outputs, presenting a new challenge in AI safety. Additionally, it serves as a wake-up call for developers of safety-aligned LLMs, pushing for improved defenses.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant for AI security researchers as it delves into advanced techniques for manipulating large language models, which is critical for understanding vulnerabilities in current AI systems. By showcasing effective methodologies for jailbreak backdoor injection, it provides insights into potential exploits that could be utilized by malicious actors, underscoring the urgent need for fortified defense mechanisms in AI model deployment and training.

📚 Read the Full Paper