← Back to Library

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Authors: Masahiro Kaneko, Zeerak Talat, Timothy Baldwin

Published: 2025-10-19

arXiv ID: 2510.17006v1

Added to Library: 2025-10-21 04:02 UTC

Red Teaming

📄 Abstract

Iterative jailbreak methods that repeatedly rewrite and input prompts into large language models (LLMs) to induce harmful outputs -- using the model's previous responses to guide each new iteration -- have been found to be a highly effective attack strategy. Despite being an effective attack strategy against LLMs and their safety mechanisms, existing defenses do not proactively disrupt this dynamic trial-and-error cycle. In this study, we propose a novel framework that dynamically updates its defense strategy through online learning in response to each new prompt from iterative jailbreak methods. Leveraging the distinctions between harmful jailbreak-generated prompts and typical harmless prompts, we introduce a reinforcement learning-based approach that optimizes prompts to ensure appropriate responses for harmless tasks while explicitly rejecting harmful prompts. Additionally, to curb overfitting to the narrow band of partial input rewrites explored during an attack, we introduce Past-Direction Gradient Damping (PDGD). Experiments conducted on three LLMs show that our approach significantly outperforms five existing defense methods against five iterative jailbreak methods. Moreover, our results indicate that our prompt optimization strategy simultaneously enhances response quality for harmless tasks.

🔍 Key Points

  • Introduction of a novel online learning defense framework for large language models (LLMs) against iterative jailbreak attacks that dynamically adjusts to new prompts.
  • Implementation of a reinforcement learning-based approach to optimize prompt outputs, effectively rejecting harmful prompts while allowing acceptable responses for harmless prompts.
  • Development of Past-Direction Gradient Damping (PDGD) to mitigate overfitting and prevent excessive adjustments in defense mechanisms based on similar previous attacks.
  • Comprehensive experiments demonstrate the effectiveness of the proposed method over five existing defense strategies across three different LLMs, achieving significant improvements both in safety and output quality.
  • Establishment of a lightweight dynamic system that could be integrated into existing LLM architectures without requiring extensive retraining or significant adjustments to model parameters.

💡 Why This Paper Matters

This paper presents significant advancements in the defense mechanisms of large language models against iterative jailbreak attacks. By employing a dynamic online learning strategy, it not only improves the robustness of LLMs against malicious prompt manipulations but also ensures that the quality of responses for harmless tasks is preserved. These contributions are crucial in enhancing the safety and effectiveness of AI systems utilized in various real-world applications, making the research highly relevant in today's AI landscape.

🎯 Why It's Interesting for AI Security Researchers

This paper would greatly interest AI security researchers as it addresses a pressing concern in the field: the vulnerability of language models to jailbreak attacks. The proposed techniques, particularly the reinforcement learning approach and PDGD, offer innovative methodologies that enhance both the security and functionality of LLMs. By demonstrating superior defense strategies against sophisticated adversarial manipulations, the findings provide a foundational basis for future research and development of more secure AI systems, fulfilling a critical need in the ongoing efforts to safeguard AI applications.

📚 Read the Full Paper