← Back to Library

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Authors: Xin Chen, Jie Zhang, Florian Tramer

Published: 2026-02-05

arXiv ID: 2602.05746v1

Added to Library: 2026-02-06 03:04 UTC

Red Teaming

📄 Abstract

Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.

🔍 Key Points

  • Introduction of AutoInject, a reinforcement learning framework that automates prompt injection attacks, enhancing scalability and adaptability over previous manual methods.
  • Utilization of a comparison-based feedback mechanism to provide dense reward signals, effectively addressing the problem of reward sparsity in reinforcement learning for adversarial attacks.
  • Empirical evaluations demonstrate superior performance of AutoInject, achieving higher attack success rates (ASR) while preserving utility on benign tasks compared to both template-based and optimization-based baselines.
  • Discovery of transferable adversarial suffixes that successfully compromise multiple models, showcasing the model's ability to generalize across unseen tasks and models effectively.
  • Identification of behavioral patterns in LLMs that enhance vulnerability to prompt injections, emphasizing the need for robust defensive mechanisms.

💡 Why This Paper Matters

The paper introduces a novel automated framework for generating prompt injection attacks through reinforcement learning, significantly improving upon existing manual and optimization-based strategies. The findings highlight critical vulnerabilities in large language models and offer valuable insights into potential defenses, making it essential reading for those involved in AI security research.

🎯 Why It's Interesting for AI Security Researchers

Given the increasing deployment of large language models in sensitive applications, understanding and mitigating their vulnerabilities is paramount. This paper presents a systematic approach to automated attacks, providing AI security researchers with tools and insights to enhance model robustness and inform protective strategies against evolving adversarial threats.

📚 Read the Full Paper