VeriGrey: Greybox Agent Validation

📄 Abstract

Agentic AI has been a topic of great interest recently. A Large Language Model (LLM) agent involves one or more LLMs in the back-end. In the front end, it conducts autonomous decision-making by combining the LLM outputs with results obtained by invoking several external tools. The autonomous interactions with the external environment introduce critical security risks. In this paper, we present a grey-box approach to explore diverse behaviors and uncover security risks in LLM agents. Our approach VeriGrey uses the sequence of tools invoked as a feedback function to drive the testing process. This helps uncover infrequent but dangerous tool invocations that cause unexpected agent behavior. As mutation operators in the testing process, we mutate prompts to design pernicious injection prompts. This is carefully accomplished by linking the task of the agent to an injection task, so that the injection task becomes a necessary step of completing the agent functionality. Comparing our approach with a black-box baseline on the well-known AgentDojo benchmark, VeriGrey achieves 33% additional efficacy in finding indirect prompt injection vulnerabilities with a GPT-4.1 back-end. We also conduct real-world case studies with the widely used coding agent Gemini CLI, and the well-known OpenClaw personal assistant. VeriGrey finds prompts inducing several attack scenarios that could not be identified by black-box approaches. In OpenClaw, by constructing a conversation agent which employs mutational fuzz testing as needed, VeriGrey is able to discover malicious skill variants from 10 malicious skills (with 10/10= 100% success rate on the Kimi-K2.5 LLM backend, and 9/10= 90% success rate on Opus 4.6 LLM backend). This demonstrates the value of a dynamic approach like VeriGrey to test agents, and to eventually lead to an agent assurance framework.

🔍 Key Points

Introduction of VeriGrey, a grey-box approach for validating LLM agents' behavior to uncover security vulnerabilities, particularly indirect prompt injection.
Demonstrated significant efficacy improvements in vulnerability detection—achieving up to a 33% greater success rate than traditional black-box testing methods.
Utilization of innovative mutation operators, specifically context-bridging, which allow injection tasks to appear as essential to the agent's main functionality, thereby increasing the likelihood of executing malicious commands.
Successful real-world case studies with popular coding agent Gemini CLI and personal assistant OpenClaw, illustrating the method's practicality and effectiveness in identifying serious vulnerabilities.
The potential contribution of VeriGrey towards a framework for agent assurance, emphasizing the need for robust security measures in increasingly autonomous AI systems.

💡 Why This Paper Matters

This paper is relevant and important as it addresses growing security concerns surrounding LLM agents, which are increasingly being integrated into critical applications across various sectors. By introducing a dynamic grey-box testing approach and demonstrating its efficacy in identifying vulnerabilities, the authors provide a significant advancement in the field of AI security, emphasizing the necessity of ongoing evaluation and assurance of AI-driven systems.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of particular interest to AI security researchers because it tackles the emerging threats posed by LLM agents and provides a novel testing framework aimed at identifying vulnerabilities that traditional black-box approaches may overlook. The findings highlight practical implications for safe deployment practices, directly informing security strategies and the ongoing development of more resilient AI systems. Additionally, the research contributes to the academic discourse around prompt injection attacks, an area with rapidly evolving challenges.

VeriGrey: Greybox Agent Validation

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper