← Back to Library

Reasoning-Oriented Programming: Chaining Semantic Gadgets to Jailbreak Large Vision Language Models

Authors: Quanchen Zou, Moyang Chen, Zonghao Ying, Wenzhuo Xu, Yisong Xiao, Deyue Zhang, Dongdong Yang, Zhao Liu, Xiangzheng Zhang

Published: 2026-03-10

arXiv ID: 2603.09246v1

Added to Library: 2026-03-11 03:00 UTC

Red Teaming

📄 Abstract

Large Vision-Language Models (LVLMs) undergo safety alignment to suppress harmful content. However, current defenses predominantly target explicit malicious patterns in the input representation, often overlooking the vulnerabilities inherent in compositional reasoning. In this paper, we identify a systemic flaw where LVLMs can be induced to synthesize harmful logic from benign premises. We formalize this attack paradigm as \textit{Reasoning-Oriented Programming}, drawing a structural analogy to Return-Oriented Programming in systems security. Just as ROP circumvents memory protections by chaining benign instruction sequences, our approach exploits the model's instruction-following capability to orchestrate a semantic collision of orthogonal benign inputs. We instantiate this paradigm via \tool{}, an automated framework that optimizes for \textit{semantic orthogonality} and \textit{spatial isolation}. By generating visual gadgets that are semantically decoupled from the harmful intent and arranging them to prevent premature feature fusion, \tool{} forces the malicious logic to emerge only during the late-stage reasoning process. This effectively bypasses perception-level alignment. We evaluate \tool{} on SafeBench and MM-SafetyBench across 7 state-of-the-art 0.LVLMs, including GPT-4o and Claude 3.7 Sonnet. Our results demonstrate that \tool{} consistently circumvents safety alignment, outperforming the strongest existing baseline by an average of 4.67\% on open-source models and 9.50\% on commercial models.

🔍 Key Points

  • Introduction of the Reasoning-Oriented Programming (ROP) paradigm, which draws parallels between software security exploits and adversarial attacks on large vision-language models (LVLMs).
  • Development of an automated attack framework called VROP, which generates semantically benign inputs to elicit harmful outputs from LVLMs by exploiting their compositional reasoning abilities.
  • Demonstration of significant attack success rates, outperforming existing models in both open-source and commercial settings, showcasing vulnerabilities in current safety alignment techniques.
  • Evaluation results showing that VROP is effective across multiple state-of-the-art LVLMs, highlighting the robustness of the attack regardless of differing safety alignment strategies.

💡 Why This Paper Matters

This paper reveals critical vulnerabilities in large vision-language models due to their compositional reasoning capabilities, demonstrating the effectiveness of the VROP attack framework. Its findings underscore the need for improved safety alignment strategies that can effectively handle the more subtle forms of adversarial input synthesis, making it a significant contribution to AI safety and security.

🎯 Why It's Interesting for AI Security Researchers

This paper is of great interest to AI security researchers as it addresses a largely underexplored area of adversarial attacks focusing on compositional reasoning in LVLMs. The introduction of VROP provides new insights into how LVLMs can be manipulated, highlighting the challenges and limitations of current safety measures. Researchers can benefit from this work by understanding the dynamics of attacks on AI systems and developing more robust defensive strategies against such emerging threats.

📚 Read the Full Paper