Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks

Authors: Jafar Isbarov, Murat Kantarcioglu

Published: 2026-02-04

arXiv ID: 2602.05066v1

Added to Library: 2026-02-06 03:00 UTC

Red Teaming

📄 Abstract

As AI agents automate critical workloads, they remain vulnerable to indirect prompt injection (IPI) attacks. Current defenses rely on monitoring protocols that jointly evaluate an agent's Chain-of-Thought (CoT) and tool-use actions to ensure alignment with user intent. We demonstrate that these monitoring-based defenses can be bypassed via a novel Agent-as-a-Proxy attack, where prompt injection attacks treat the agent as a delivery mechanism, bypassing both agent and monitor simultaneously. While prior work on scalable oversight has focused on whether small monitors can supervise large agents, we show that even frontier-scale monitors are vulnerable. Large-scale monitoring models like Qwen2.5-72B can be bypassed by agents with similar capabilities, such as GPT-4o mini and Llama-3.1-70B. On the AgentDojo benchmark, we achieve a high attack success rate against AlignmentCheck and Extract-and-Evaluate monitors under diverse monitoring LLMs. Our findings suggest current monitoring-based agentic defenses are fundamentally fragile regardless of model scale.

🔍 Key Points

Introduction of the Agent-as-a-Proxy attack, which exploits AI agents as tools to bypass monitoring defenses.
Development of the Parallel-GCG optimization algorithm to generate effective adversarial prompts across multiple contexts.
Demonstration that hybrid monitoring systems are more vulnerable to adaptive attacks compared to traditional monitoring methods focused solely on Chain-of-Thought (CoT).
Empirical evidence showing significant attack success rates (ASRs) against leading monitoring models, challenging the assumption that increasing model size enhances security.
Findings underscore the fragility of current monitoring-based defenses and the need for structural security enhancements.

💡 Why This Paper Matters

This paper is crucial in exposing inherent vulnerabilities in existing monitoring protocols for AI agents, highlighting the potential for sophisticated attacks that can undermine security efforts. By revealing the mechanisms through which commonly used monitoring systems can fail, the paper encourages a rethinking of security strategies in AI deployment, stressing the importance of developing more robust and fundamentally secure monitoring frameworks.

🎯 Why It's Interesting for AI Security Researchers

This paper is significant for AI security researchers as it not only elucidates the specific vulnerabilities within current monitoring systems but also introduces novel attack vectors that can be utilized against these systems. The insights into hybrid monitoring's weaknesses, paired with substantial empirical data, make this research a pivotal contribution to the field of AI safety and security, prompting further investigations into adaptive defense mechanisms.

Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper