โ† Back to Library

When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

Authors: Devanshu Sahoo, Manish Prasad, Vasudev Majhi, Jahnvi Singh, Vinay Chamola, Yash Sinha, Murari Mandal, Dhruv Kumar

Published: 2025-12-11

arXiv ID: 2512.10449v1

Added to Library: 2025-12-12 03:00 UTC

Red Teaming

๐Ÿ“„ Abstract

The landscape of scientific peer review is rapidly evolving with the integration of Large Language Models (LLMs). This shift is driven by two parallel trends: the widespread individual adoption of LLMs by reviewers to manage workload (the "Lazy Reviewer" hypothesis) and the formal institutional deployment of AI-powered assessment systems by conferences like AAAI and Stanford's Agents4Science. This study investigates the robustness of these "LLM-as-a-Judge" systems (both illicit and sanctioned) to adversarial PDF manipulation. Unlike general jailbreaks, we focus on a distinct incentive: flipping "Reject" decisions to "Accept," for which we develop a novel evaluation metric which we term as WAVS (Weighted Adversarial Vulnerability Score). We curated a dataset of 200 scientific papers and adapted 15 domain-specific attack strategies to this task, evaluating them across 13 Language Models, including GPT-5, Claude Haiku, and DeepSeek. Our results demonstrate that obfuscation strategies like "Maximum Mark Magyk" successfully manipulate scores, achieving alarming decision flip rates even in large-scale models. We will release our complete dataset and injection framework to facilitate more research on this topic.

๐Ÿ” Key Points

  • Introduction of the Weighted Adversarial Vulnerability Score (WAVS), a novel metric for quantifying the susceptibility of LLM-based review systems to adversarial manipulation.
  • Development of 15 domain-specific attack strategies that exploit the unique structure and operational logic of LLMs in scientific peer review, showcasing how these models can be influenced by carefully structured inputs.
  • Demonstrated alarming rates of decision manipulation (from Reject to Accept) across various language models, indicating significant vulnerabilities even in advanced models like GPT-5 and Claude Haiku.
  • Curated a dataset of 200 scientific papers designed specifically for testing LLM sensitivity to adversarial attacks, which will be openly available to facilitate further research in this area.
  • Provided insights into ethical implications and the potential for eroding trust in automated review processes, calling for robust defenses against such vulnerabilities.

๐Ÿ’ก Why This Paper Matters

This paper highlights critical vulnerabilities in LLMs used for scientific peer review, showing that adversarial manipulations can significantly undermine the integrity of the review process. By presenting a structured approach to assess these vulnerabilities, including the introduction of the WAVS metric and various sophisticated attack strategies, the study illuminates the urgent need for better safeguards in automated scientific evaluation systems. The findings are relevant not only for developers of LLM systems but also for stakeholders in academic publishing.

๐ŸŽฏ Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly interesting due to its focus on adversarial attacks within a specific and high-stakes domainโ€”scientific peer review. The detailed analysis of model vulnerabilities exposes real-world implications of LLM behavior when faced with adversarial inputs, providing a critical case study that can inform the development of more robust AI systems and highlight potential areas for defensive research in AI security.

๐Ÿ“š Read the Full Paper