An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

📄 Abstract

With the remarkable success of Vision-Language Models (VLMs) on multimodal tasks, concerns regarding their deployment efficiency have become increasingly prominent. In particular, the number of tokens consumed during the generation process has emerged as a key evaluation metric.Prior studies have shown that specific inputs can induce VLMs to generate lengthy outputs with low information density, which significantly increases energy consumption, latency, and token costs. However, existing methods simply delay the occurrence of the EOS token to implicitly prolong output, and fail to directly maximize the output token length as an explicit optimization objective, lacking stability and controllability.To address these limitations, this paper proposes a novel verbose-text induction attack (VTIA) to inject imperceptible adversarial perturbations into benign images via a two-stage framework, which identifies the most malicious prompt embeddings for optimizing and maximizing the output token of the perturbed images.Specifically, we first perform adversarial prompt search, employing reinforcement learning strategies to automatically identify adversarial prompts capable of inducing the LLM component within VLMs to produce verbose outputs. We then conduct vision-aligned perturbation optimization to craft adversarial examples on input images, maximizing the similarity between the perturbed image's visual embeddings and those of the adversarial prompt, thereby constructing malicious images that trigger verbose text generation. Comprehensive experiments on four popular VLMs demonstrate that our method achieves significant advantages in terms of effectiveness, efficiency, and generalization capability.

🔍 Key Points

Introduction of Verbose-Text Induction Attack (VTIA) targeting Vision-Language Models (VLMs) to increase output token lengths through controlled image perturbations.
Development of a two-stage attack framework employing reinforcement learning for adversarial prompt search followed by vision-aligned perturbation optimization, enabling effective and stable control over generated verbosity.
Demonstration of significant advantages over previous methods, achieving up to 121.90× longer outputs across popular VLMs while maintaining visual imperceptibility of adversarial inputs, thus posing a substantial security risk.
Comprehensive experimentation on four popular VLMs (Blip2, InstructBlip, LLaVA, Qwen2-VL) showing that VTIA consistently outperforms baseline methods, emphasizing the need for energy-efficient language model deployments.
Detailed analysis through ablation studies to evaluate the impact of different components and settings of the attack, hence contributing valuable insights into adversarial image perturbation strategies.

💡 Why This Paper Matters

This paper presents crucial findings regarding the vulnerabilities of Vision-Language Models (VLMs) to verbose-text induction attacks, highlighting potential inefficiencies in their deployment. With the increasing integration of VLMs in applications demanding efficiency, understanding and mitigating such attacks have become instrumental for developers and researchers alike.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant to AI security researchers as it elucidates a novel attack vector targeting the efficiency of VLMs, shedding light on a critical aspect of adversarial machine learning. By demonstrating practical exploitation methods that induce excessive token generation, it raises awareness about energy consumption and operational costs associated with AI deployments, prompting the need for developing more robust defensive measures.

An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper