← Back to Library

CourtGuard: A Local, Multiagent Prompt Injection Classifier

Authors: Isaac Wu, Michael Maslowski

Published: 2025-10-20

arXiv ID: 2510.19844v1

Added to Library: 2025-11-11 14:35 UTC

Red Teaming

📄 Abstract

As large language models (LLMs) become integrated into various sensitive applications, prompt injection, the use of prompting to induce harmful behaviors from LLMs, poses an ever increasing risk. Prompt injection attacks can cause LLMs to leak sensitive data, spread misinformation, and exhibit harmful behaviors. To defend against these attacks, we propose CourtGuard, a locally-runnable, multiagent prompt injection classifier. In it, prompts are evaluated in a court-like multiagent LLM system, where a "defense attorney" model argues the prompt is benign, a "prosecution attorney" model argues the prompt is a prompt injection, and a "judge" model gives the final classification. CourtGuard has a lower false positive rate than the Direct Detector, an LLM as-a-judge. However, CourtGuard is generally a worse prompt injection detector. Nevertheless, this lower false positive rate highlights the importance of considering both adversarial and benign scenarios for the classification of a prompt. Additionally, the relative performance of CourtGuard in comparison to other prompt injection classifiers advances the use of multiagent systems as a defense against prompt injection attacks. The implementations of CourtGuard and the Direct Detector with full prompts for Gemma-3-12b-it, Llama-3.3-8B, and Phi-4-mini-instruct are available at https://github.com/isaacwu2000/CourtGuard.

🔍 Key Points

  • Introduces CourtGuard, a novel multiagent system for prompt injection classification utilizing a 'court-like' evaluation model with a defense attorney, prosecution attorney, and judge framework.
  • Achieves lower false positive rates compared to the Direct Detector, highlighting a balance between adversarial prompt detection and the management of benign scenarios.
  • Demonstrates that CourtGuard, while being less effective in detecting prompt injections than the Direct Detector, performs better in classifying benign prompts, posing implications for real-world applications with sensitive data.
  • Provides a detailed evaluation of various existing prompt injection detection methodologies, showing that multiagent systems like CourtGuard can advance strategies for defending against prompt injection attacks.
  • Presents a comprehensive empirical analysis on different LLM models (Gemma, Llama, Phi) to compare performance on datasets like LLMail-Inject, NotInject, and Qualifire Prompt Injection Benchmark.

💡 Why This Paper Matters

The paper presents significant advancements in prompt injection detection through the CourtGuard system, emphasizing the potential of multiagent frameworks to improve the security of large language models in sensitive applications. By providing lower false positive rates and better classification of benign inputs, it demonstrates practical implications for enterprises handling confidential information, reinforcing the necessity for robust defenses against adversarial attacks in AI systems.

🎯 Why It's Interesting for AI Security Researchers

This paper is crucial for AI security researchers as it addresses a rapidly growing concern around prompt injection attacks which threaten the integrity and safety of applications using large language models. Its innovative approach, utilizing a multiagent system to classify prompts, offers new insights and methodologies that can enhance understanding and implementation of defenses against sophisticated prompt injection techniques. Furthermore, the empirical comparisons and analyses provided can serve as benchmarks for future research in the field.

📚 Read the Full Paper