← Back to Library

Paraphrasing Adversarial Attack on LLM-as-a-Reviewer

Authors: Masahiro Kaneko

Published: 2026-01-11

arXiv ID: 2601.06884v1

Added to Library: 2026-01-13 03:01 UTC

Red Teaming

📄 Abstract

The use of large language models (LLMs) in peer review systems has attracted growing attention, making it essential to examine their potential vulnerabilities. Prior attacks rely on prompt injection, which alters manuscript content and conflates injection susceptibility with evaluation robustness. We propose the Paraphrasing Adversarial Attack (PAA), a black-box optimization method that searches for paraphrased sequences yielding higher review scores while preserving semantic equivalence and linguistic naturalness. PAA leverages in-context learning, using previous paraphrases and their scores to guide candidate generation. Experiments across five ML and NLP conferences with three LLM reviewers and five attacking models show that PAA consistently increases review scores without changing the paper's claims. Human evaluation confirms that generated paraphrases maintain meaning and naturalness. We also find that attacked papers exhibit increased perplexity in reviews, offering a potential detection signal, and that paraphrasing submissions can partially mitigate attacks.

🔍 Key Points

  • Introduction of the Paraphrasing Adversarial Attack (PAA), a novel black-box optimization method that targets LLMs used in peer reviews by generating meaning-preserving paraphrases to inflate review scores.
  • PAA demonstrates that LLMs can be manipulated effectively to yield higher review scores while maintaining semantic equivalence and linguistic naturalness, showcasing potential vulnerabilities in automated review systems.
  • Experiments across multiple ML and NLP conferences confirmed that PAA increased review scores significantly compared to original manuscripts and simple paraphrasing methods, thereby revealing systemic biases in LLMs.
  • Human evaluations validated that paraphrases generated through PAA preserved meaning and naturalness, highlighting the method's sophistication over prior attack methods which often compromised semantic integrity.
  • Identification of peculiarities such as self-preference bias in LLM reviewers and increased perplexity in adversarial reviews, suggesting avenues for potential detection and defense mechanisms.

💡 Why This Paper Matters

This paper is critical as it highlights the vulnerabilities of LLMs when used as reviewers, illustrating how easily their evaluation capabilities can be manipulated through seemingly innocuous paraphrasing. Its findings serve to inform developers and researchers about the inherent risks of employing such models in automated systems, urging for enhanced safeguards and assessments to ensure the integrity of peer review processes.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper relevant as it exposes significant weaknesses in LLMs that are currently being integrated into high-stakes systems like academic peer reviews. Understanding the mechanisms through which these attacks occur can lead to better models and frameworks for building secure AI systems, as well as strategies for detecting and mitigating such adversarial attacks.

📚 Read the Full Paper