Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

📄 Abstract

Prompt attacks, including jailbreaks and prompt injections, pose a critical security risk to Large Language Model (LLM) systems. In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while high-capacity LLM-based judges remain too slow or costly for live enforcement. In this work, we examine whether lightweight, general-purpose LLMs can reliably serve as security judges under real-world production constraints. Through careful prompt and output design, lightweight LLMs are guided through a structured reasoning process involving explicit intent decomposition, safety-signal verification, harm assessment, and self-reflection. We evaluate our method on a curated dataset combining benign queries from real-world chatbots with adversarial prompts generated via automated red teaming (ART), covering diverse and evolving patterns. Our results show that general-purpose LLMs, such as gemini-2.0-flash-lite-001, can serve as effective low-latency judges for live guardrails. This configuration is currently deployed in production as a centralized guardrail service for public service chatbots in Singapore. We additionally evaluate a Mixture-of-Models (MoM) setting to assess whether aggregating multiple LLM judges improves prompt-attack detection performance relative to single-model judges, with only modest gains observed.

🔍 Key Points

Introduction of the LLM-as-a-Judge approach, which utilizes structured reasoning to enhance prompt-attack detection capabilities in LLM systems.
Evaluation of a curated dataset that combines real-world benign queries and adversarial prompts to assess the effectiveness of the proposed method.
Demonstration that lightweight, general-purpose LLMs can operate effectively as low-latency judges for real-time guardrails, outperforming both traditional classifiers and specialized models.
Investigating a Mixture-of-Models approach to aggregate judgments from multiple LLMs, revealing modest improvements and emphasizing careful model selection for effectiveness.
Findings underscore the importance of prompt design and structured outputs in enhancing the reliability and accuracy of attack detection in conversational AI.

💡 Why This Paper Matters

This paper significantly contributes to the field of AI security by providing a practical framework for detecting prompt attacks in LLMs using a novel LLM-as-a-Judge methodology. This approach not only demonstrates the potential of lightweight models in production environments but also highlights the critical role of structured reasoning and prompt design in improving detection accuracy. Overall, this research addresses key challenges in real-world LLM deployments, making it highly relevant for developers and researchers focused on enhancing the robustness of AI systems against adversarial exploitation.

🎯 Why It's Interesting for AI Security Researchers

This paper is of particular interest to AI security researchers as it tackles the pressing issue of prompt attacks on LLMs, which pose significant risks in operational settings. The introduction of the LLM-as-a-Judge approach, combined with the empirical evaluation of a practical dataset, offers insights into real-world defenses against adversarial attacks. Additionally, the results on the Mixture-of-Models method provide a nuanced understanding of model ensemble strategies in enhancing security, making this work a valuable resource for those investigating advanced defenses in AI applications.

Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper