← Back to Library

Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks

Authors: Viet K. Nguyen, Mohammad I. Husain

Published: 2025-12-16

arXiv ID: 2512.14860v1

Added to Library: 2026-01-07 10:11 UTC

Red Teaming

📄 Abstract

Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address. Although recent work by Unit 42 at Palo Alto Networks demonstrated that ChatGPT-4o successfully executes attacks as an agent that it refuses in chat mode, there is no comparative analysis in multiple models and frameworks. We conducted the first systematic penetration testing and comparative evaluation of agentic AI systems, testing five prominent models (Claude 3.5 Sonnet, Gemini 2.5 Flash, GPT-4o, Grok 2, and Nova Pro) across two agentic AI frameworks (AutoGen and CrewAI) using a seven-agent architecture that mimics the functionality of a university information management system and 13 distinct attack scenarios that span prompt injection, Server Side Request Forgery (SSRF), SQL injection, and tool misuse. Our 130 total test cases reveal significant security disparities: AutoGen demonstrates a 52.3% refusal rate versus CrewAI's 30.8%, while model performance ranges from Nova Pro's 46.2% to Claude and Grok 2's 38.5%. Most critically, Grok 2 on CrewAI rejected only 2 of 13 attacks (15.4% refusal rate), and the overall refusal rate of 41.5% across all configurations indicates that more than half of malicious prompts succeeded despite enterprise-grade safety mechanisms. We identify six distinct defensive behavior patterns including a novel "hallucinated compliance" strategy where models fabricate outputs rather than executing or refusing attacks, and provide actionable recommendations for secure agent deployment. Complete attack prompts are also included in the Appendix to enable reproducibility.

🔍 Key Points

  • First comprehensive comparative security analysis of agentic AI systems across multiple models and frameworks.
  • Significant security disparities revealed, with AutoGen showing a 52.3% refusal rate compared to CrewAI's 30.8%.
  • Unique discovery of a 'hallucinated compliance' defense strategy, where models fabricate outputs instead of refusing or executing malicious commands.
  • Nova Pro shown to have the highest security performance among the tested models, with a 46.2% refusal rate, while Grok 2 showed critical vulnerabilities with a mere 15.4% refusal rate on CrewAI.
  • Detailed taxonomy of defensive behavior patterns provides insights for developing more secure agentic AI systems.

💡 Why This Paper Matters

This paper is crucial as it reveals significant security vulnerabilities in agentic AI systems that conventional safeguards cannot address. By conducting a thorough comparative analysis, it highlights the urgent need for improved security mechanisms in the deployment of agentic AI, making it highly relevant for researchers and practitioners in AI security.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper are of great interest to AI security researchers as it exposes underlying vulnerabilities in widely used AI models, especially regarding their architectural differences in security responses. The introduction of new attack vectors and defenses brings vital insights that can guide future research in AI threat models and security protocols.

📚 Read the Full Paper