Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security

📄 Abstract

The rapid advancement of multimodal large language models (MLLMs) has led to breakthroughs in various applications, yet their security remains a critical challenge. One pressing issue involves unsafe image-query pairs--jailbreak inputs specifically designed to bypass security constraints and elicit unintended responses from MLLMs. Compared to general multimodal data, such unsafe inputs are relatively sparse, which limits the diversity and richness of training samples available for developing robust defense models. Meanwhile, existing guardrail-type methods rely on external modules to enforce security constraints but fail to address intrinsic vulnerabilities within MLLMs. Traditional supervised fine-tuning (SFT), on the other hand, often over-refuses harmless inputs, compromising general performance. Given these challenges, we propose Secure Tug-of-War (SecTOW), an innovative iterative defense-attack training method to enhance the security of MLLMs. SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO). During the iterative process, the attacker identifies security vulnerabilities in the defense model and expands jailbreak data. The expanded data are then used to train the defender, enabling it to address identified security vulnerabilities. We also design reward mechanisms used for GRPO to simplify the use of response labels, reducing dependence on complex generative labels and enabling the efficient use of synthetic data. Additionally, a quality monitoring mechanism is used to mitigate the defender's over-refusal of harmless inputs and ensure the diversity of the jailbreak data generated by the attacker. Experimental results on safety-specific and general benchmarks demonstrate that SecTOW significantly improves security while preserving general performance.

🔍 Key Points

SecTOW introduces an iterative training framework combining a defender and an auxiliary attacker, enhancing the defense mechanisms in multimodal large language models (MLLMs) against jailbreak attacks.
The use of reinforcement learning (GRPO) allows for effective training with simplified reward mechanisms, reducing dependence on complex generative labels and increasingly utilizing synthetic data for robustness.
A dual assurance model is established that maintains the security of MLLMs while preserving general performance, addressing the issue of over-refusal of harmless queries.
Quality monitoring mechanisms are implemented to mitigate reward hacking and ensure diversity in generated jailbreak data, enhancing the model's overall robustness.
Empirical results demonstrate SecTOW's superior performance on multiple safety-specific benchmarks, significantly reducing attack success rates and maintaining generalization capabilities.

💡 Why This Paper Matters

This paper presents a novel method called SecTOW that effectively enhances the security of multimodal large language models through an innovative iterative defense-attacking training process. By leveraging reinforcement learning and incorporating quality monitoring strategies, SecTOW addresses intrinsic model vulnerabilities while balancing robustness and general performance. Its significant empirical results highlight the method's potential to improve the safety and reliability of AI systems in real-world applications.

🎯 Why It's Interesting for AI Security Researchers

This paper is of great interest to AI security researchers as it tackles a pressing issue in AI safety—the vulnerability of multimodal models to adversarial attacks. The proposed framework, with its unique iterative training approach and reinforcement learning strategies, offers a promising avenue for developing resilient AI systems. With growing concerns about the misuse of AI technologies, understanding and improving model defenses will be crucial for the future of safe AI deployment.

Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper