← Back to Library

Optimization-based Prompt Injection Attack to LLM-as-a-Judge

Authors: Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, Neil Zhenqiang Gong

Published: 2024-03-26

arXiv ID: 2403.17710v5

Added to Library: 2025-11-11 14:20 UTC

Red Teaming

📄 Abstract

LLM-as-a-Judge uses a large language model (LLM) to select the best response from a set of candidates for a given question. LLM-as-a-Judge has many applications such as LLM-powered search, reinforcement learning with AI feedback (RLAIF), and tool selection. In this work, we propose JudgeDeceiver, an optimization-based prompt injection attack to LLM-as-a-Judge. JudgeDeceiver injects a carefully crafted sequence into an attacker-controlled candidate response such that LLM-as-a-Judge selects the candidate response for an attacker-chosen question no matter what other candidate responses are. Specifically, we formulate finding such sequence as an optimization problem and propose a gradient based method to approximately solve it. Our extensive evaluation shows that JudgeDeceive is highly effective, and is much more effective than existing prompt injection attacks that manually craft the injected sequences and jailbreak attacks when extended to our problem. We also show the effectiveness of JudgeDeceiver in three case studies, i.e., LLM-powered search, RLAIF, and tool selection. Moreover, we consider defenses including known-answer detection, perplexity detection, and perplexity windowed detection. Our results show these defenses are insufficient, highlighting the urgent need for developing new defense strategies. Our implementation is available at this repository: https://github.com/ShiJiawenwen/JudgeDeceiver.

🔍 Key Points

  • JudgeDeceiver introduces an optimization-based framework for crafting prompt injection attacks specifically targeting LLM-as-a-Judge, distinguishing it from previous manually crafted attacks.
  • The paper formulates a novel optimization problem to generate injected sequences that maximize the likelihood of an attacker-designated response being chosen, using a combination of target-aligned generation loss, target-enhancement loss, and adversarial perplexity loss.
  • Extensive evaluations demonstrate that JudgeDeceiver significantly outperforms existing manual prompt injection methods and jailbreak attacks across multiple benchmark datasets and real-world scenarios, achieving attack success rates upwards of 90%.
  • The research identifies critical weaknesses in current defense mechanisms against prompt injection attacks, emphasizing the insufficiency of known-answer detection and perplexity-based defenses to effectively counter JudgeDeceiver.
  • Multiple case studies illustrate the practical implications of the attack in applications like LLM-powered search, reinforcement learning from AI feedback (RLAIF), and tool selection, highlighting the vulnerabilities in widely used AI systems.

💡 Why This Paper Matters

This paper is relevant and important as it systematically exposes the vulnerabilities of LLM-integrated applications to sophisticated prompt injection attacks. The development of JudgeDeceiver not only enhances our understanding of attack mechanisms in AI but also signals a pressing need for improved defense strategies. By showcasing the efficacy of this novel attack methodology, the research emphasizes the critical importance of securing AI systems, especially as their integration into real-world applications grows.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of interest to AI security researchers because it reveals significant vulnerabilities in large language models, particularly when they act as evaluators. The novel techniques and methods introduced in JudgeDeceiver represent a sophisticated approach to adversarial manipulation, which could inspire future defensive measures and further research on AI security. Understanding these attack vectors is crucial for researchers aiming to bolster the resilience of AI systems against malicious exploits.

📚 Read the Full Paper