← Back to Library

Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks

Authors: Jafar Isbarov, Murat Kantarcioglu

Published: 2026-02-04

arXiv ID: 2602.05066v2

Added to Library: 2026-02-26 03:01 UTC

Red Teaming

📄 Abstract

As AI agents automate critical workloads, they remain vulnerable to indirect prompt injection (IPI) attacks. Current defenses rely on monitoring protocols that jointly evaluate an agent's Chain-of-Thought (CoT) and tool-use actions to ensure alignment with user intent. We demonstrate that these monitoring-based defenses can be bypassed via a novel Agent-as-a-Proxy attack, where prompt injection attacks treat the agent as a delivery mechanism, bypassing both agent and monitor simultaneously. While prior work on scalable oversight has focused on whether small monitors can supervise large agents, we show that even frontier-scale monitors are vulnerable. Large-scale monitoring models like Qwen2.5-72B can be bypassed by agents with similar capabilities, such as GPT-4o mini and Llama-3.1-70B. On the AgentDojo benchmark, we achieve a high attack success rate against AlignmentCheck and Extract-and-Evaluate monitors under diverse monitoring LLMs. Our findings suggest current monitoring-based agentic defenses are fundamentally fragile regardless of model scale.

🔍 Key Points

  • Introduction of the Agent-as-a-Proxy attack framework, demonstrating that AI agents can be exploited to bypass their own monitoring systems.
  • Development of the Parallel-GCG optimization algorithm, which ensures adversarial strings remain stealthy and effective across multiple contexts in agentic workflows.
  • Empirical findings showing that hybrid monitoring defenses, while designed to be robust, are paradoxically more vulnerable to adaptive attacks compared to simpler monitoring protocols.
  • High attack success rates (over 90%) achieved against both AlignmentCheck and extract-and-evaluate monitors, challenging the effectiveness of large-scale models as adequate protection.
  • Demonstration that no significant capability gap is required between agents and their monitors to succeed in bypassing defenses, emphasizing flaws in current security assumptions.

💡 Why This Paper Matters

This paper is crucial as it exposes significant vulnerabilities in AI monitoring protocols that are critical for the safe deployment of autonomous agents. As AI systems become more autonomous and integral to various applications, understanding and mitigating these vulnerabilities is paramount to ensure user safety and maintain trust in AI technologies.

🎯 Why It's Interesting for AI Security Researchers

The research is of high interest to AI security researchers as it identifies and details new attack vectors that could be exploited in real-world applications of AI. By highlighting the inherent weaknesses in commonly implemented monitoring frameworks, this work provides a foundation for developing more secure AI systems, emphasizing the need for robust design rather than mere scaling of models or defenses.

📚 Read the Full Paper