VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

📄 Abstract

Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.

🔍 Key Points

Introduces VERA-V, a novel variational inference framework for jailbreaking Vision-Language Models (VLMs) by generating adversarial text-image prompts.
Employs a probabilistic approach to learn a joint posterior distribution, enhancing the diversity and stealth of generated jailbreaks compared to existing methods.
Integrates typography-based text prompts with diffusion-based image synthesis and structured distractors to fragment model attention and improve attack success rates.
Achieves notable improvements in attack success rates (ASR), outperforming state-of-the-art baselines by up to 53.75% across various benchmarks, including HarmBench and HADES.
Demonstrates the effectiveness of VERA-V through extensive experiments on multiple target VLM architectures, illustrating its adaptability and performance against both open-source and closed-source models.

💡 Why This Paper Matters

The development of the VERA-V framework represents a significant advancement in understanding and exploiting vulnerabilities in multimodal AI systems. By combining innovative techniques in variational inference and multimodal prompt generation, this research provides a robust tool for red-teaming efforts, ensuring the security and reliability of VLMs as they become increasingly integrated into technology.

🎯 Why It's Interesting for AI Security Researchers

This paper addresses critical security concerns surrounding the deployment of Vision-Language Models by systematically exposing and analyzing their vulnerabilities. AI security researchers will find the techniques outlined in VERA-V valuable for developing more effective attack strategies and countermeasures, as well as examining the limits of existing safety mechanisms in multimodal AI systems.

VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper