← Back to Library

AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema

Authors: Ting-Chun Liu, Ching-Yu Hsu, Kuan-Yi Lee, Chi-An Fu, Hung-yi Lee

Published: 2025-08-27

arXiv ID: 2509.00088v2

Added to Library: 2025-11-11 14:16 UTC

Red Teaming

📄 Abstract

Prompt injection attacks pose a significant challenge to the safe deployment of Large Language Models (LLMs) in real-world applications. While prompt-based detection offers a lightweight and interpretable defense strategy, its effectiveness has been hindered by the need for manual prompt engineering. To address this issue, we propose AEGIS , an Automated co-Evolutionary framework for Guarding prompt Injections Schema. Both attack and defense prompts are iteratively optimized against each other using a gradient-like natural language prompt optimization technique. This framework enables both attackers and defenders to autonomously evolve via a Textual Gradient Optimization (TGO) module, leveraging feedback from an LLM-guided evaluation loop. We evaluate our system on a real-world assignment grading dataset of prompt injection attacks and demonstrate that our method consistently outperforms existing baselines, achieving superior robustness in both attack success and detection. Specifically, the attack success rate (ASR) reaches 1.0, representing an improvement of 0.26 over the baseline. For detection, the true positive rate (TPR) improves by 0.23 compared to the previous best work, reaching 0.84, and the true negative rate (TNR) remains comparable at 0.89. Ablation studies confirm the importance of co-evolution, gradient buffering, and multi-objective optimization. We also confirm that this framework is effective in different LLMs. Our results highlight the promise of adversarial training as a scalable and effective approach for guarding prompt injections.

🔍 Key Points

  • Proposes AEGIS, an automated adversarial co-evolution framework that enhances the security of Large Language Models (LLMs) against prompt injection attacks.
  • Implements a Textual Gradient Optimization (TGO+) technique that allows iterative prompt optimization through adversarial training without the need for manual prompt engineering.
  • Demonstrates superior performance over baseline methods, achieving a True Positive Rate (TPR) of 0.84 and a True Negative Rate (TNR) of 0.89, indicating effective detection of prompt injection attacks while maintaining benign input utility.
  • Conducts comprehensive ablation studies that highlight the importance of co-evolutive training, gradient replay, and multi-objective optimization in maintaining robustness against evolving adversaries.
  • Shows cross-model generalizability of the defense prompts, confirming the applicability of the framework across various LLMs.

💡 Why This Paper Matters

The paper presents a significant advancement in the defense mechanisms against prompt injection attacks in the deployment of Large Language Models. By automating the co-evolution of attack and defense strategies, AEGIS enhances both the detection and resilience of LLMs to adversarial prompts, demonstrating practical implications for improving the safety of AI applications in real-world scenarios.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it addresses the critical challenge of prompt injection attacks, which pose a substantial threat to the integrity and functionality of LLMs. The novel methods introduced, such as TGO+, and the systematic evaluation of adversarial training strategies offer valuable insights and methodologies that can be employed to enhance the security of AI systems. Moreover, the exploration of cross-model generalizability is particularly pertinent given the increasing diversity of LLM architectures being developed.

📚 Read the Full Paper