← Back to Library

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Authors: Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov, Narine Kokhlikyan, Tom Goldstein, Kamalika Chaudhuri, Chuan Guo

Published: 2025-10-06

arXiv ID: 2510.04885v1

Added to Library: 2025-10-07 04:01 UTC

Red Teaming

📄 Abstract

Prompt injection poses a serious threat to the reliability and safety of LLM agents. Recent defenses against prompt injection, such as Instruction Hierarchy and SecAlign, have shown notable robustness against static attacks. However, to more thoroughly evaluate the robustness of these defenses, it is arguably necessary to employ strong attacks such as automated red-teaming. To this end, we introduce RL-Hammer, a simple recipe for training attacker models that automatically learn to perform strong prompt injections and jailbreaks via reinforcement learning. RL-Hammer requires no warm-up data and can be trained entirely from scratch. To achieve high ASRs against industrial-level models with defenses, we propose a set of practical techniques that enable highly effective, universal attacks. Using this pipeline, RL-Hammer reaches a 98% ASR against GPT-4o and a $72\%$ ASR against GPT-5 with the Instruction Hierarchy defense. We further discuss the challenge of achieving high diversity in attacks, highlighting how attacker models tend to reward-hack diversity objectives. Finally, we show that RL-Hammer can evade multiple prompt injection detectors. We hope our work advances automatic red-teaming and motivates the development of stronger, more principled defenses. Code is available at https://github.com/facebookresearch/rl-injector.

🔍 Key Points

  • Introduction of RL-Hammer, a reinforcement learning method for training attacker models to perform strong prompt injections against LLMs without requiring warm-up data.
  • Proposals for enhancing the efficiency of these attacks through techniques such as removing KL regularization, joint training on multiple target models, and enforcing a restricted prompt format.
  • Demonstration of high attack success rates (ASR) against robust models like GPT-4o and GPT-5, achieving 98% ASR and 72% ASR respectively, indicating significant vulnerabilities in current defenses.
  • Analysis of diversity in generated attacks, highlighting reward-hacking tendencies and the challenges of producing genuinely varied attack strategies.
  • Empirical results showing RL-Hammer's ability to evade multiple detection methods, emphasizing its effectiveness and the need for improved defense mechanisms.

💡 Why This Paper Matters

This paper introduces a significant method in AI security by demonstrating a novel approach to automated prompt injection attacks using reinforcement learning. Its findings underscore the fragility of current defenses against such threats, making it crucial for advancing research in securing AI models.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would be particularly interested in this paper as it not only introduces a new and effective attack methodology but also presents compelling evidence of the weaknesses in existing defenses. The detailed exploration of attack success rates, diversity, and detectability challenges serves as a catalyst for developing more robust protective measures against prompt injection attacks, a critical area of concern in AI safety.

📚 Read the Full Paper