← Back to Library

Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice

Authors: Yuxu Ge

Published: 2026-03-07

arXiv ID: 2603.07191v2

Added to Library: 2026-03-11 03:01 UTC

Red Teaming

📄 Abstract

Autonomous agents powered by large language models introduce a class of execution-layer vulnerabilities -- prompt injection, retrieval poisoning, and uncontrolled tool invocation -- that existing guardrails fail to address systematically. In this work, we propose the Layered Governance Architecture (LGA), a four-layer framework comprising execution sandboxing (L1), intent verification (L2), zero-trust inter-agent authorization (L3), and immutable audit logging (L4). To evaluate LGA, we construct a bilingual benchmark (Chinese original, English via machine translation) of 1,081 tool-call samples -- covering prompt injection, RAG poisoning, and malicious skill plugins -- and apply it to OpenClaw, a representative open-source agent framework. Experimental results on Layer 2 intent verification with four local LLM judges (Qwen3.5-4B, Llama-3.1-8B, Qwen3.5-9B, Qwen2.5-14B) and one cloud judge (GPT-4o-mini) show that all five LLM judges intercept 93.0-98.5% of TC1/TC2 malicious tool calls, while lightweight NLI baselines remain below 10%. TC3 (malicious skill plugins) proves harder at 75-94% IR among judges with meaningful precision-recall balance, motivating complementary enforcement at Layers 1 and 3. Qwen2.5-14B achieves the best local balance (98% IR, approximately 10-20% FPR); a two-stage cascade (Qwen3.5-9B->GPT-4o-mini) achieves 91.9-92.6% IR with 1.9-6.7% FPR; a fully local cascade (Qwen3.5-9B->Qwen2.5-14B) achieves 94.7-95.6% IR with 6.0-9.7% FPR for data-sovereign deployments. An end-to-end pipeline evaluation (n=100) demonstrates that all four layers operate in concert with 96% IR and a total P50 latency of approximately 980 ms, of which the non-judge layers contribute only approximately 18 ms. Generalization to the external InjecAgent benchmark yields 99-100% interception, confirming robustness beyond our synthetic data.

🔍 Key Points

  • Introduction of the Layered Governance Architecture (LGA), a four-layer framework designed to address execution-layer vulnerabilities in autonomous agents.
  • Experimental validation of LGA on a bilingual benchmark with 1,081 tool-call samples, demonstrating high interception rates of malicious tool calls by various judges based on LLMs and NLI models.
  • Identification of three primary threat classes in autonomous multi-agent systems: prompt injection, RAG data poisoning, and malicious skill plugins, accompanied by formal definitions and detailed evaluations of interception capabilities.
  • Comparison of local LLM judges versus lightweight NLI models, highlighting significant differences in interception rates and the implications for security and usability trade-offs.
  • Successful integration of all four LGA layers in an end-to-end evaluation achieving 96% interception rate with minimal latency, underscoring the practical viability of the architecture.

💡 Why This Paper Matters

This paper is of utmost relevance as it provides a comprehensive framework for enhancing the security of autonomous agent systems which are increasingly susceptible to sophisticated attacks. By addressing gaps in existing safeguards and proposing a layered approach, the findings contribute significantly to the field of AI security, promoting safer deployment of powerful LLMs in real-world applications.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper significant as it elucidates a systematic framework for mitigating security risks specifically linked to large language models in autonomous systems. The comprehensive evaluation of security threats and the innovative defense mechanisms proposed can inform future research and development in enhancing the safety and robustness of AI applications.

📚 Read the Full Paper