← Back to Library

When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

Authors: Devanshu Sahoo, Manish Prasad, Vasudev Majhi, Jahnvi Singh, Vinay Chamola, Yash Sinha, Murari Mandal, Dhruv Kumar

Published: 2025-12-11

arXiv ID: 2512.10449v3

Added to Library: 2026-01-07 10:13 UTC

Red Teaming

📄 Abstract

Driven by surging submission volumes, scientific peer review has catalyzed two parallel trends: individual over-reliance on LLMs and institutional AI-powered assessment systems. This study investigates the robustness of "LLM-as-a-Judge" systems to adversarial PDF manipulation via invisible text injections and layout aware encoding attacks. We specifically target the distinct incentive of flipping "Reject" decisions to "Accept," a vulnerability that fundamentally compromises scientific integrity. To measure this, we introduce the Weighted Adversarial Vulnerability Score (WAVS), a novel metric that quantifies susceptibility by weighting score inflation against the severity of decision shifts relative to ground truth. We adapt 15 domain-specific attack strategies, ranging from semantic persuasion to cognitive obfuscation, and evaluate them across 13 diverse language models (including GPT-5 and DeepSeek) using a curated dataset of 200 official and real-world accepted and rejected submissions (e.g., ICLR OpenReview). Our results demonstrate that obfuscation techniques like "Maximum Mark Magyk" and "Symbolic Masking & Context Redirection" successfully manipulate scores, achieving decision flip rates of up to 86.26% in open-source models, while exposing distinct "reasoning traps" in proprietary systems. We release our complete dataset and injection framework to facilitate further research on the topic (https://anonymous.4open.sciencer/llm-jailbreak-FC9E/).

🔍 Key Points

  • The study introduces a novel metric, the Weighted Adversarial Vulnerability Score (WAVS), which quantifies the vulnerability of LLM-based reviewers by assessing the severity of decision flips in relation to score inflation.
  • The authors adapt 15 domain-specific attack strategies, including cognitive obfuscation and layout-aware encoding attacks, to specifically target the peer review context, highlighting the unique vulnerabilities present in 'LLM-as-a-Judge' systems.
  • Experiments demonstrate that certain obfuscation techniques can lead to decision flip rates of up to 86.26%, significantly compromising the integrity of the scientific review process by allowing manipulated papers to achieve 'Accept' decisions.
  • The paper provides a comprehensive robustness analysis of 13 language models against these attack strategies, revealing distinct vulnerabilities across models, with proprietary systems exhibiting varying levels of robustness compared to open-source alternatives.
  • The authors emphasize the ethical implications of their findings, urging immediate action to improve the robustness and integrity of LLM-enabled review systems, thus contributing to ongoing discussions about AI safety and academic integrity.

💡 Why This Paper Matters

This paper is significant as it uncovers critical vulnerabilities in automated scientific review systems powered by LLMs, providing a quantitative framework for understanding and mitigating these risks. The introduction of WAVS as a metric fosters a better understanding of the impact of adversarial manipulations on the integrity of scientific dissemination.

🎯 Why It's Interesting for AI Security Researchers

The research is highly relevant to AI security researchers as it directly addresses the vulnerabilities of large language models in a practical context, revealing how adversarial manipulation can undermine academic integrity. The insights into attack strategies and the introduction of a new vulnerability metric provide valuable knowledge for the development of more secure AI systems, and highlight the necessity of safeguarding AI applications in sensitive domains such as scientific peer review.

📚 Read the Full Paper