SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs

Authors: Juan Ren, Mark Dras, Usman Naseem

Published: 2025-10-15

arXiv ID: 2510.13190v1

Added to Library: 2025-10-16 05:00 UTC

Red Teaming

📄 Abstract

Large Vision-Language Models (LVLMs) unlock powerful multimodal reasoning but also expand the attack surface, particularly through adversarial inputs that conceal harmful goals in benign prompts. We propose SHIELD, a lightweight, model-agnostic preprocessing framework that couples fine-grained safety classification with category-specific guidance and explicit actions (Block, Reframe, Forward). Unlike binary moderators, SHIELD composes tailored safety prompts that enforce nuanced refusals or safe redirection without retraining. Across five benchmarks and five representative LVLMs, SHIELD consistently lowers jailbreak and non-following rates while preserving utility. Our method is plug-and-play, incurs negligible overhead, and is easily extendable to new attack types -- serving as a practical safety patch for both weakly and strongly aligned LVLMs.

🔍 Key Points

Introduction of SHIELD, a model-agnostic preprocessing framework that enhances safety in Large Vision-Language Models (LVLMs) through classifier-guided prompting.
Utilizes a fine-grained safety classification system paired with explicit actions (Block, Reframe, Forward), allowing nuanced responses instead of binary moderation.
Empirical results across five benchmark datasets show SHIELD significantly reduces jailbreak and non-following rates, enhancing the robustness of both weakly and strongly aligned models.
Features a plug-and-play design, meaning it can be easily integrated into existing systems without requiring extensive retraining or modifications.
Proposes an ablation study demonstrating the effectiveness of category-specific safety prompts and action directives, contributing to a clearer understanding of safety mechanisms in AI models.

💡 Why This Paper Matters

The research presents SHIELD, providing a critical enhancement to the safety protocols in LVLMs, which is vital as these models are increasingly utilized in sensitive applications. Its lightweight and adaptable methodology ensures that AI developers can improve public safety while maintaining model utility, making it a significant step in responsible AI deployment.

🎯 Why It's Interesting for AI Security Researchers

This paper is essential for AI security researchers as it addresses a pressing issue of safety in multimodal AI systems. With the potential for adversarial attacks on LVLMs, understanding and implementing effective safety mechanisms is crucial. SHIELD's findings on safety classification and tailored action prompts offer valuable insights into developing robust defense strategies against emerging threats, providing a foundation for future research in AI safety and security.

SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper