FENCE: A Financial and Multimodal Jailbreak Detection Dataset

📄 Abstract

Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces. However, available resources for jailbreak detection are scarce, particularly in finance. To address this gap, we present FENCE, a bilingual (Korean-English) multimodal dataset for training and evaluating jailbreak detectors in financial applications. FENCE emphasizes domain realism through finance-relevant queries paired with image-grounded threats. Experiments with commercial and open-source VLMs reveal consistent vulnerabilities, with GPT-4o showing measurable attack success rates and open-source models displaying greater exposure. A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models. FENCE provides a focused resource for advancing multimodal jailbreak detection in finance and for supporting safer, more reliable AI systems in sensitive domains. Warning: This paper includes example data that may be offensive.

🔍 Key Points

Introduction of FENCE, a bilingual (Korean-English) multimodal dataset specifically designed for jailbreaking detection in financial applications.
Emphasis on image-grounded threats, addressing a critical yet underexplored area of multimodal vulnerabilities in financial domain.
Demonstration of FENCE's robustness with a baseline detector achieving 99% in-distribution accuracy, showcasing its effectiveness for training reliable detection models in real-world scenarios.
Comprehensive evaluation of various models on FENCE, revealing that even those with strong safety alignments exhibit vulnerabilities, highlighting the dataset's utility as a bench testing tool.
Diverse financial scenarios integrated into the dataset, allowing it to cover more than 15 finance-related topics, thus enhancing its applicability and relevance to practical scenarios.

💡 Why This Paper Matters

The paper presents FENCE, a pioneering dataset that addresses significant gaps in the detection of jailbreak vulnerabilities within financial applications of multimodal models. By providing a focused resource rooted in realistic financial scenarios, it enables the development of safer AI systems that can better withstand adversarial attacks, particularly in sensitive sectors such as finance. This work represents a crucial step toward ensuring the reliability of AI systems deployed in high-risk environments.

🎯 Why It's Interesting for AI Security Researchers

This paper is of particular interest to AI security researchers as it tackles the pressing issue of jailbreak vulnerabilities in multimodal systems, specifically in the financial domain where the stakes are very high. By establishing a dataset that reflects real-world risks and demonstrates effective training methodologies for detection models, this research lays foundational work that can help in fortifying AI systems against sophisticated adversarial techniques. Additionally, the bilingual aspect enhances the dataset's reach, promoting research in diverse linguistic and cultural contexts that further underscores the importance of robust AI security measures.

FENCE: A Financial and Multimodal Jailbreak Detection Dataset

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper