← Back to Library

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

Authors: Weikai Lu, Ziqian Zeng, Kehua Zhang, Haoran Li, Huiping Zhuang, Ruidong Wang, Cen Chen, Hao Peng

Published: 2025-12-05

arXiv ID: 2512.05745v1

Added to Library: 2025-12-08 03:00 UTC

📄 Abstract

Multimodal Large Language Models (MLLMs) are increasingly vulnerable to multimodal Indirect Prompt Injection (IPI) attacks, which embed malicious instructions in images, videos, or audio to hijack model behavior. Existing defenses, designed primarily for text-only LLMs, are unsuitable for countering these multimodal threats, as they are easily bypassed, modality-dependent, or generalize poorly. Inspired by activation steering researches, we hypothesize that a robust, general defense independent of modality can be achieved by steering the model's behavior in the representation space. Through extensive experiments, we discover that the instruction-following behavior of MLLMs is encoded in a subspace. Steering along directions within this subspace can enforce adherence to user instructions, forming the basis of a defense. However, we also found that a naive defense direction could be coupled with a utility-degrading direction, and excessive intervention strength harms model performance. To address this, we propose ARGUS, which searches for an optimal defense direction within the safety subspace that decouples from the utility degradation direction, further combining adaptive strength steering to achieve a better safety-utility trade-off. ARGUS also introduces lightweight injection detection stage to activate the defense on-demand, and a post-filtering stage to verify defense success. Experimental results show that ARGUS can achieve robust defense against multimodal IPI while maximally preserving the MLLM's utility.

🔍 Key Points

  • Introduction of the Visual Reasoning Sequential Attack (VRSA), a novel method for conducting jailbreak attacks on Multimodal Large Language Models (MLLMs) by decomposing harmful intentions into sequential sub-images.
  • Development of Adaptive Scene Refinement to enhance the rationality of visual scenes relevant to harmful intents, which improves model understanding during the attack.
  • Semantic Coherent Completion is proposed to ensure continuity in the text narratives guiding the image sequences, enhancing the logical flow and consistency of the attack.
  • Text-Image Consistency Alignment technique ensures that there is high semantic similarity between the generated images and their corresponding text, boosting the attack's effectiveness.
  • Extensive experiments demonstrating that VRSA significantly outperforms existing state-of-the-art jailbreak methods on both open-source and closed-source MLLMs.

💡 Why This Paper Matters

The VRSA method represents a significant advancement in the understanding and exploitation of vulnerabilities within multimodal models, particularly focusing on the visual reasoning capabilities that have been overlooked in previous research. This paper's contributions are critical as they highlight potential safety risks associated with increasing complexity in AI systems, underscoring the need for improved security measures in MLLMs which are becoming increasingly integral in various applications.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it not only provides insights into specific vulnerabilities of MLLMs through a sophisticated attack method but also illustrates the broader implications of multimodal AI usage in sensitive domains. The findings and proposed methodologies open up new avenues for research into securing AI models against emerging hybrid attack vectors that leverage the inherent complexities of visual and textual interactions.

📚 Read the Full Paper