Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers

Authors: Andrew Zhao, Reshmi Ghosh, Vitor Carvalho, Emily Lawton, Keegan Hines, Gao Huang, Jack W. Stokes

Published: 2025-10-16

arXiv ID: 2510.14381v1

Added to Library: 2025-11-14 23:11 UTC

Red Teaming

📄 Abstract

Large language model (LLM) systems now underpin everyday AI applications such as chatbots, computer-use assistants, and autonomous robots, where performance often depends on carefully designed prompts. LLM-based prompt optimizers reduce that effort by iteratively refining prompts from scored feedback, yet the security of this optimization stage remains underexamined. We present the first systematic analysis of poisoning risks in LLM-based prompt optimization. Using HarmBench, we find systems are substantially more vulnerable to manipulated feedback than to injected queries: feedback-based attacks raise attack success rate (ASR) by up to $Δ$ASR = 0.48. We introduce a simple fake-reward attack that requires no access to the reward model and significantly increases vulnerability, and we propose a lightweight highlighting defense that reduces the fake-reward $Δ$ASR from 0.23 to 0.07 without degrading utility. These results establish prompt optimization pipelines as a first-class attack surface and motivate stronger safeguards for feedback channels and optimization frameworks.

🔍 Key Points

First systematic analysis of poisoning risks in LLM-based prompt optimization, highlighting vulnerabilities in feedback manipulation over query manipulation.
Identification of the fake-reward attack that significantly raises attack success rates by providing misleading feedback without access to the reward model.
Development of a lightweight defense mechanism (highlighting) that effectively mitigates the impact of fake-reward attacks while maintaining system utility.
Empirical evidence showing that prompt optimization metrics critically influence the susceptibility of systems to adversarial exploitation, emphasizing the need for careful metric selection.
Proposed an actionable framework for securing feedback channels in LLM-based optimizers, marking prompt optimization pipelines as a new attack surface in AI safety.

💡 Why This Paper Matters

This paper is crucial for advancing the understanding of security vulnerabilities inherent in LLM-based optimization processes. By presenting a novel class of feedback manipulation attacks and highlighting the importance of robust defense mechanisms, the authors contribute significantly to the field of AI safety. Their findings advocate for re-evaluating existing practices in prompt optimization pipelines, which are increasingly used in real-world applications.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly relevant as it addresses a critical gap in the literature concerning the vulnerabilities specific to LLM-based optimization methods. The exploration of feedback manipulation attacks, the introduction of the fake-reward attack, and the proposed defenses align with ongoing concerns about the security of machine learning systems. As these models become more embedded in sensitive applications, understanding their threats and implementing effective safeguards is paramount.

Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper