Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

Authors: Yu Yan, Sheng Sun, Shengjia Cheng, Teli Liu, Mingfeng Li, Min Liu

Published: 2026-02-09

arXiv ID: 2602.10148v1

Added to Library: 2026-02-12 03:01 UTC

Red Teaming

📄 Abstract

Vision-Language Models (VLMs) with multimodal reasoning capabilities are high-value attack targets, given their potential for handling complex multimodal harmful tasks. Mainstream black-box jailbreak attacks on VLMs work by distributing malicious clues across modalities to disperse model attention and bypass safety alignment mechanisms. However, these adversarial attacks rely on simple and fixed image-text combinations that lack attack complexity scalability, limiting their effectiveness for red-teaming VLMs' continuously evolving reasoning capabilities. We propose \textbf{CrossTALK} (\textbf{\underline{Cross}}-modal en\textbf{\underline{TA}}ng\textbf{\underline{L}}ement attac\textbf{\underline{K}}), which is a scalable approach that extends and entangles information clues across modalities to exceed VLMs' trained and generalized safety alignment patterns for jailbreak. Specifically, {knowledge-scalable reframing} extends harmful tasks into multi-hop chain instructions, {cross-modal clue entangling} migrates visualizable entities into images to build multimodal reasoning links, and {cross-modal scenario nesting} uses multimodal contextual instructions to steer VLMs toward detailed harmful outputs. Experiments show our COMET achieves state-of-the-art attack success rate.

🔍 Key Points

Introduction of COMET (Cross-Modal Entanglement Attack), a novel scalable framework for jailbreaking Vision-Language Models (VLMs) that outperforms existing methods with 94% attack success rate against advanced VLMs.
The framework employs three innovative techniques: Knowledge-Scalable Reframing to create multi-hop tasks, Cross-Modal Clue Entangling to disperse harmful semantics across modalities, and Cross-Modal Scenario Nesting to guide VLMs towards harmful outputs while appearing innocuous.
COMET addresses crucial weaknesses in existing jailbreak approaches, which typically rely on fixed, simplistic attack strategies that can be easily recognized by advanced VLMs.
Extensive experimental validation across nine different VLMs demonstrates COMET's superior effectiveness in red-teaming VLM safety alignment mechanisms, highlighting significant vulnerabilities within these models.
The study emphasizes the urgent need for improved defenses against evolving multimodal reasoning vulnerabilities in VLMs, indicating COMET's potential for informing the development of more robust AI security protocols.

💡 Why This Paper Matters

This paper significantly contributes to the understanding of vulnerabilities within Vision-Language Models by introducing a novel and effective attack method, COMET. By demonstrating high efficacy against state-of-the-art VLMs, the research highlights pressing concerns about the safety of multimodal AI systems, necessitating urgent attention in developing comprehensive safeguards and robust defenses.

🎯 Why It's Interesting for AI Security Researchers

This paper is of substantial interest to AI security researchers as it not only uncovers critical vulnerabilities in multimodal reasoning capabilities of VLMs but also presents a sophisticated framework for exploiting these weaknesses. Insights from this study can guide future defensive approaches, making it essential for researchers focused on enhancing AI robustness and security.

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper