← Back to Library

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

Authors: Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang

Published: 2025-10-17

arXiv ID: 2510.15430v1

Added to Library: 2025-10-20 04:01 UTC

Red Teaming

📄 Abstract

Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.

🔍 Key Points

  • Introduction of the Learning to Detect (LoD) framework that generalizes jailbreak attack detection without relying on attack-specific training data.
  • Development of the Multi-modal Safety Concept Activation Vector (MSCAV) module for effective safety-oriented representation learning.
  • Implementation of a Safety Pattern Auto-Encoder (SPAE) for improving attack classification through anomaly detection, leading to enhanced performance on unseen attacks.
  • Extensive experiments demonstrate the LoD framework consistently achieving a high AUROC in detecting diverse unknown jailbreak attacks across multiple LVLMs.
  • Efficiency improvements over previous methods, with significantly reduced detection times.

💡 Why This Paper Matters

This paper presents a significant advancement in the detection of jailbreak attacks on Large Vision-Language Models (LVLMs). By creating a learnable framework that operates without attack-specific parameters, the authors address a critical gap in the security of AI systems. The reported improvements in detection accuracy and generalization ability mark a substantial contribution to ensuring the safety and reliability of LVLMs, making them particularly relevant given the increasing use of such models in sensitive applications.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper are crucial for AI security researchers as they provide a robust methodological foundation for addressing the susceptibility of LVLMs to jailbreak attacks. The novel approach of using task-specific learning instead of attack-specific parameters opens avenues for research into more resilient AI models and detection systems. Furthermore, this study highlights the importance of anomaly detection techniques in enhancing the security of AI systems against evolving threats, underscoring the ongoing challenge of ensuring AI model safety.

📚 Read the Full Paper