← Back to Library

Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization

Authors: Jiwei Guan, Haibo Jin, Haohan Wang

Published: 2026-01-05

arXiv ID: 2601.01747v1

Added to Library: 2026-01-07 10:02 UTC

Red Teaming

πŸ“„ Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have shown groundbreaking capabilities across diverse multimodal tasks. However, these models remain vulnerable to adversarial jailbreak attacks, where adversaries craft subtle perturbations to bypass safety mechanisms and trigger harmful outputs. Existing white-box attacks methods require full model accessibility, suffer from computing costs and exhibit insufficient adversarial transferability, making them impractical for real-world, black-box settings. To address these limitations, we propose a black-box jailbreak attack on LVLMs via Zeroth-Order optimization using Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). ZO-SPSA provides three key advantages: (i) gradient-free approximation by input-output interactions without requiring model knowledge, (ii) model-agnostic optimization without the surrogate model and (iii) lower resource requirements with reduced GPU memory consumption. We evaluate ZO-SPSA on three LVLMs, including InstructBLIP, LLaVA and MiniGPT-4, achieving the highest jailbreak success rate of 83.0% on InstructBLIP, while maintaining imperceptible perturbations comparable to white-box methods. Moreover, adversarial examples generated from MiniGPT-4 exhibit strong transferability to other LVLMs, with ASR reaching 64.18%. These findings underscore the real-world feasibility of black-box jailbreaks and expose critical weaknesses in the safety mechanisms of current LVLMs

πŸ” Key Points

  • Introduction of ZO-SPSA is a novel black-box adversarial attack method specifically designed for Large Vision-Language Models (LVLMs), enabling effective attacks without model access.
  • Achieved the highest attack success rate (ASR) of 83.0% on InstructBLIP, showcasing the effectiveness of black-box attacks compared to current white-box techniques.
  • Demonstrated strong transferability of adversarial examples across different LVLMs, achieving a 64.18% ASR from MiniGPT-4 to other models, emphasizing vulnerabilities in model safety mechanisms.
  • Lowered computational requirements and GPU memory consumption significantly while maintaining attack effectiveness, making the method practical for real-world applications.
  • Evaluated adversarial inputs across multiple toxicity dimensions, revealing substantial increases in the probability of generating harmful responses, thus exposing critical safety flaws.

πŸ’‘ Why This Paper Matters

This paper’s introduction of the ZO-SPSA methodology is crucial in understanding the vulnerabilities of LVLMs to adversarial attacks, especially in real-world scenarios where model access is restricted. By proving that adversarial attacks can bypass existing safety mechanisms, this research highlights the importance of robust safety protocols in the deployment of large-scale AI models.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper are particularly significant for AI security researchers because they expose underlying vulnerabilities in the safety mechanisms of widely used LVLMs. As these models become increasingly integrated into applications, understanding and mitigating adversarial risks is essential. The proposed methods not only advance the knowledge of adversarial machine learning but also have practical implications for developing more secure AI systems that can withstand such vulnerabilities.

πŸ“š Read the Full Paper