QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety

Authors: Taegyeong Lee, Jeonghwa Yoo, Hyoungseo Cho, Soo Yong Kim, Yunho Maeng

Published: 2025-06-14

arXiv ID: 2506.12299v1

Added to Library: 2025-06-17 03:03 UTC

Safety

📄 Abstract

The recent advancements in Large Language Models(LLMs) have had a significant impact on a wide range of fields, from general domains to specialized areas. However, these advancements have also significantly increased the potential for malicious users to exploit harmful and jailbreak prompts for malicious attacks. Although there have been many efforts to prevent harmful prompts and jailbreak prompts, protecting LLMs from such malicious attacks remains an important and challenging task. In this paper, we propose QGuard, a simple yet effective safety guard method, that utilizes question prompting to block harmful prompts in a zero-shot manner. Our method can defend LLMs not only from text-based harmful prompts but also from multi-modal harmful prompt attacks. Moreover, by diversifying and modifying guard questions, our approach remains robust against the latest harmful prompts without fine-tuning. Experimental results show that our model performs competitively on both text-only and multi-modal harmful datasets. Additionally, by providing an analysis of question prompting, we enable a white-box analysis of user inputs. We believe our method provides valuable insights for real-world LLM services in mitigating security risks associated with harmful prompts.

🔍 Key Points

Introduction of QGuard, a novel safety guard mechanism leveraging question prompting for harmful prompt detection in LLMs without requiring fine-tuning.
Demonstration of QGuard's capability to defend against both text and multi-modal harmful inputs in a zero-shot fashion, thus significantly increasing the adaptability of LLM security measures.
The research validates the performance of QGuard against multiple baseline models, showing superior effectiveness in identifying harmful prompts while requiring minimal resources.
The paper provides a detailed method for generating guard questions and a filtering algorithm (PageRank based) to robustly classify harmful content, enabling a white-box analysis of LLM decision-making processes.
Experimental results indicate that QGuard not only meets but often exceeds the performance of fine-tuned counterparts in detecting harmful prompts, evidencing its practical applicability.

💡 Why This Paper Matters

The paper presents QGuard as an innovative solution to a pressing challenge in AI safety, addressing the critical need for effective mechanisms to guard against harmful prompts in LLMs. The ability to operate in a zero-shot manner—without requiring extensive fine-tuning—positions QGuard as a practical approach suitable for real-world applications, significantly enhancing the security landscape of generative AI technologies.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant for AI security researchers as it tackles the significant issue of malicious usage of LLMs through harmful prompt exploitation. The introduction of a novel method like QGuard not only opens avenues for advanced prompt management and detection mechanisms but also emphasizes the importance of creating adaptive security measures that can function without the burden of continuous model retraining. Additionally, the methodology and findings presented can inform further research and development in secure AI practices, ultimately contributing to safer AI deployment in sensitive applications.

QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper