Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Authors: Xin Chen, Jie Zhang, Florian Tramer

Published: 2026-02-05

arXiv ID: 2602.05746v1

Added to Library: 2026-02-06 03:04 UTC

Red Teaming

📄 Abstract

Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.

🔍 Key Points

Introduction of AutoInject, a reinforcement learning framework that automates prompt injection attacks, enhancing scalability and adaptability over previous manual methods.
Utilization of a comparison-based feedback mechanism to provide dense reward signals, effectively addressing the problem of reward sparsity in reinforcement learning for adversarial attacks.
Empirical evaluations demonstrate superior performance of AutoInject, achieving higher attack success rates (ASR) while preserving utility on benign tasks compared to both template-based and optimization-based baselines.
Discovery of transferable adversarial suffixes that successfully compromise multiple models, showcasing the model's ability to generalize across unseen tasks and models effectively.
Identification of behavioral patterns in LLMs that enhance vulnerability to prompt injections, emphasizing the need for robust defensive mechanisms.

💡 Why This Paper Matters

The paper introduces a novel automated framework for generating prompt injection attacks through reinforcement learning, significantly improving upon existing manual and optimization-based strategies. The findings highlight critical vulnerabilities in large language models and offer valuable insights into potential defenses, making it essential reading for those involved in AI security research.

🎯 Why It's Interesting for AI Security Researchers

Given the increasing deployment of large language models in sensitive applications, understanding and mitigating their vulnerabilities is paramount. This paper presents a systematic approach to automated attacks, providing AI security researchers with tools and insights to enhance model robustness and inform protective strategies against evolving adversarial threats.

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper