← Back to Library

"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

Authors: Qin Zhou, Zhexin Zhang, Zhi Li, Limin Sun

Published: 2025-11-03

arXiv ID: 2511.01287v1

Added to Library: 2025-11-11 14:26 UTC

Red Teaming

📄 Abstract

With the rapid advancement of AI models, their deployment across diverse tasks has become increasingly widespread. A notable emerging application is leveraging AI models to assist in reviewing scientific papers. However, recent reports have revealed that some papers contain hidden, injected prompts designed to manipulate AI reviewers into providing overly favorable evaluations. In this work, we present an early systematic investigation into this emerging threat. We propose two classes of attacks: (1) static attack, which employs a fixed injection prompt, and (2) iterative attack, which optimizes the injection prompt against a simulated reviewer model to maximize its effectiveness. Both attacks achieve striking performance, frequently inducing full evaluation scores when targeting frontier AI reviewers. Furthermore, we show that these attacks are robust across various settings. To counter this threat, we explore a simple detection-based defense. While it substantially reduces the attack success rate, we demonstrate that an adaptive attacker can partially circumvent this defense. Our findings underscore the need for greater attention and rigorous safeguards against prompt-injection threats in AI-assisted peer review.

🔍 Key Points

  • Introduction of In-Paper Prompt Injection (IPI) attacks that manipulate AI reviewers through maliciously embedded prompts in scientific papers.
  • Development of two primary attack methods: static attacks with fixed prompts and iterative attacks that refine prompts based on reviewer feedback.
  • Demonstration of robustness of these attacks across various parameters such as position of injection, human ratings, and AI reviewer models, achieving significant increases in evaluation scores.
  • Proposed a detection-based defense that identifies potential IPI attacks, although the study found it could be circumvented by adaptive adversarial strategies.
  • Extensive experimental evaluation revealing deficiencies in current AI reviewer pipelines and highlighting the necessity for stronger defenses against such threats.

💡 Why This Paper Matters

This paper is significant as it addresses the emerging threat of IPI attacks in AI-assisted peer review processes, illustrating how easily academic evaluations can be manipulated. By revealing the vulnerabilities of currently deployed AI models, the authors not only contribute to the field of AI security but also call for a paradigm shift in how we safeguard the integrity of academic peer review, ensuring that it remains objective and trustworthy.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly relevant as it delves into the novel domain of prompt injection attacks within AI review systems, a growing concern as AI models become integral to high-stakes decision-making processes. This research provides insights into attack methodologies and the effectiveness of defenses, which can inform future studies and developments aimed at enhancing the security and robustness of AI systems.

📚 Read the Full Paper