MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

Authors: Georgios Syros, Evan Rose, Brian Grinstead, Christoph Kerschbaumer, William Robertson, Cristina Nita-Rotaru, Alina Oprea

Published: 2026-02-09

arXiv ID: 2602.09222v1

Added to Library: 2026-02-11 03:03 UTC

Red Teaming

📄 Abstract

Large language model (LLM) based web agents are increasingly deployed to automate complex online tasks by directly interacting with web sites and performing actions on users' behalf. While these agents offer powerful capabilities, their design exposes them to indirect prompt injection attacks embedded in untrusted web content, enabling adversaries to hijack agent behavior and violate user intent. Despite growing awareness of this threat, existing evaluations rely on fixed attack templates, manually selected injection surfaces, or narrowly scoped scenarios, limiting their ability to capture realistic, adaptive attacks encountered in practice. We present MUZZLE, an automated agentic framework for evaluating the security of web agents against indirect prompt injection attacks. MUZZLE utilizes the agent's trajectories to automatically identify high-salience injection surfaces, and adaptively generate context-aware malicious instructions that target violations of confidentiality, integrity, and availability. Unlike prior approaches, MUZZLE adapts its attack strategy based on the agent's observed execution trajectory and iteratively refines attacks using feedback from failed executions. We evaluate MUZZLE across diverse web applications, user tasks, and agent configurations, demonstrating its ability to automatically and adaptively assess the security of web agents with minimal human intervention. Our results show that MUZZLE effectively discovers 37 new attacks on 4 web applications with 10 adversarial objectives that violate confidentiality, availability, or privacy properties. MUZZLE also identifies novel attack strategies, including 2 cross-application prompt injection attacks and an agent-tailored phishing scenario.

🔍 Key Points

Introduction of MUZZLE, a fully automated red-teaming framework designed to discover indirect prompt injection (IPI) attacks on web agents.
MUZZLE adapts its attack strategies based on the evolving behavior of targeted web agents, addressing the limitations of static approaches used in prior research.
Evaluation of MUZZLE across four distinct web applications led to the discovery of 37 new IPI attacks, including innovative attack strategies such as cross-application prompt injection and agent-tailored phishing scenarios.
The framework demonstrates its capacity for real-time adaptive learning by utilizing feedback from failed attack attempts to refine and optimize its future attack payloads.
MUZZLE highlights new security concerns in web agents by effectively exploiting vulnerabilities spanning multiple interconnected applications.

💡 Why This Paper Matters

This paper is highly relevant as it provides a crucial advancement in the security evaluation of web agents. The novel MUZZLE framework represents a significant leap forward in automated attack discovery, revealing latent vulnerabilities that traditional methods fail to capture. Its adaptive approach to red-teaming is not only innovative but essential for evolving defense strategies against increasingly sophisticated cyber threats targeting AI systems. By paving the way for a deeper understanding of indirect prompt injection vulnerabilities, this research contributes to creating more secure AI-driven applications in real-world scenarios.

🎯 Why It's Interesting for AI Security Researchers

This paper is of particular interest to AI security researchers because it tackles a critical and increasingly prevalent threat in the field of AI—indirect prompt injection attacks. The findings challenge existing security assumptions about AI-driven web agents and highlight the necessity for robust evaluation frameworks like MUZZLE. Furthermore, the capacity for automatic attack discovery and adaptation makes this work a vital reference for developing more sophisticated defenses and understanding the evolving landscape of AI vulnerabilities.

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper