Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections

Authors: Narek Maloyan, Dmitry Namiot

Published: 2025-04-25

arXiv ID: 2504.18333v1

Added to Library: 2025-11-11 14:18 UTC

Red Teaming

📄 Abstract

LLM as judge systems used to assess text quality code correctness and argument strength are vulnerable to prompt injection attacks. We introduce a framework that separates content author attacks from system prompt attacks and evaluate five models Gemma 3.27B Gemma 3.4B Llama 3.2 3B GPT 4 and Claude 3 Opus on four tasks with various defenses using fifty prompts per condition. Attacks achieved up to seventy three point eight percent success smaller models proved more vulnerable and transferability ranged from fifty point five to sixty two point six percent. Our results contrast with Universal Prompt Injection and AdvPrompter We recommend multi model committees and comparative scoring and release all code and datasets

🔍 Key Points

Introduced a comprehensive framework for analyzing adversarial attacks on LLM-as-a-judge systems, distinguishing between content-author and system-prompt attacks.
Evaluated five LLM models and multiple attack variants, demonstrating that sophisticated attacks can achieve up to 73.8% success rates, particularly against smaller models.
Demonstrated the effectiveness of multi-model committees as a robust defense mechanism, reducing attack success rates significantly (by up to 47 points) when compared to individual models.
Provided rigorous experimental validations across diverse evaluation tasks, revealing significant differences in model vulnerability based on architecture and task type.
Proposed novel attack strategies, including the Adaptive Search-Based Attack, which outperformed existing methods like Universal-Prompt-Injection and AdvPrompter.

💡 Why This Paper Matters

This paper provides crucial insights into the vulnerabilities of LLM-as-a-judge systems to adversarial attacks and presents a systematic approach to evaluating and defending against these attacks. Its findings underscore the need for enhanced security measures in deploying AI systems used for high-stakes evaluations, making it a pivotal contribution to the field of AI safety.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper relevant because it addresses the pressing issue of adversarial vulnerabilities in powerful LLMs, which are increasingly used in sensitive applications. The paper not only identifies critical weaknesses but also presents novel methodologies and defenses for improving the safety and reliability of LLM-as-a-judge systems, fostering a deeper understanding of adversarial mechanisms and effective countermeasures in AI.

Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper