Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach

Authors: Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang

Published: 2025-08-08

arXiv ID: 2508.09201v1

Added to Library: 2025-08-14 23:15 UTC

Red Teaming

📄 Abstract

Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. Although recent detection works have shifted to internal representations due to their rich cross-modal information, most methods rely on heuristic rules rather than principled objectives, resulting in suboptimal performance. To address these limitations, we propose Learning to Detect (LoD), a novel unsupervised framework that formulates jailbreak detection as anomaly detection. LoD introduces two key components: Multi-modal Safety Concept Activation Vectors (MSCAV), which capture layer-wise safety-related representations across modalities, and the Safety Pattern Auto-Encoder, which models the distribution of MSCAV derived from safe inputs and detects anomalies via reconstruction errors. By training the auto-encoder (AE) solely on safe samples without attack labels, LoD naturally identifies jailbreak inputs as distributional anomalies, enabling accurate and unified detection of jailbreak attacks. Comprehensive experiments on three different LVLMs and five benchmarks demonstrate that LoD achieves state-of-the-art performance, with an average AUROC of 0.9951 and an improvement of up to 38.89% in the minimum AUROC over the strongest baselines.

🔍 Key Points

Introduction of Learning to Detect (LoD), an unsupervised framework that reformulates jailbreak detection in LVLMs as an anomaly detection problem, enhancing detection capabilities without prior knowledge of attack types.
Development of Multi-modal Safety Concept Activation Vectors (MSCAV) to capture safety-related representations across multiple layers of LVLMs, enabling more accurate discovery of jailbreak events.
Innovation of the Safety Pattern Auto-Encoder which utilizes reconstruction errors to identify anomalies based on safe input samples, significantly improving detection precision compared to heuristic methods.
Extensive experiments demonstrate that LoD achieves state-of-the-art performance with an average AUROC of 0.9951, showcasing improvements of up to 38.89% in robustness against jailbreak attacks.
The proposed method highlights the importance of leveraging internal model representations for more effective safety mechanisms in LVLMs.

💡 Why This Paper Matters

This paper presents a significant advancement in the detection of jailbreak attacks on large vision-language models through a novel unsupervised anomaly detection framework, which not only enhances the robustness of LVLMs against diverse unknown jailbreak threats but also establishes new standards for performance in model safety across the field.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly relevant as it addresses the critical challenge of ensuring the safety of large vision-language models that are increasingly susceptible to sophisticated attack vectors. The approach of using internal model representations paired with unsupervised learning offers a new paradigm for detecting and mitigating vulnerabilities, paving the way for security innovations in AI and fostering trust in multimodal AI systems.

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper