← Back to Library

STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

Authors: Jing-Jing Li, Jianfeng He, Chao Shang, Devang Kulshreshtha, Xun Xian, Yi Zhang, Hang Su, Sandesh Swamy, Yanjun Qi

Published: 2025-09-30

arXiv ID: 2509.25624v1

Added to Library: 2025-10-01 04:02 UTC

Red Teaming

📄 Abstract

As LLMs advance into autonomous agents with tool-use capabilities, they introduce security challenges that extend beyond traditional content-based LLM safety concerns. This paper introduces Sequential Tool Attack Chaining (STAC), a novel multi-turn attack framework that exploits agent tool use. STAC chains together tool calls that each appear harmless in isolation but, when combined, collectively enable harmful operations that only become apparent at the final execution step. We apply our framework to automatically generate and systematically evaluate 483 STAC cases, featuring 1,352 sets of user-agent-environment interactions and spanning diverse domains, tasks, agent types, and 10 failure modes. Our evaluations show that state-of-the-art LLM agents, including GPT-4.1, are highly vulnerable to STAC, with attack success rates (ASR) exceeding 90% in most cases. The core design of STAC's automated framework is a closed-loop pipeline that synthesizes executable multi-step tool chains, validates them through in-environment execution, and reverse-engineers stealthy multi-turn prompts that reliably induce agents to execute the verified malicious sequence. We further perform defense analysis against STAC and find that existing prompt-based defenses provide limited protection. To address this gap, we propose a new reasoning-driven defense prompt that achieves far stronger protection, cutting ASR by up to 28.8%. These results highlight a crucial gap: defending tool-enabled agents requires reasoning over entire action sequences and their cumulative effects, rather than evaluating isolated prompts or responses.

🔍 Key Points

  • Introduction of Sequential Tool Attack Chaining (STAC), a novel multi-turn attack framework designed to exploit tool use in LLM agents.
  • Empirical evaluation showed that state-of-the-art LLM agents, including GPT-4.1, have attack success rates exceeding 90% under STAC, indicating significant vulnerabilities.
  • Development of a closed-loop automated framework that synthesizes, validates, and executes malicious multi-step tool chains, revealing the hidden effects of benign-seeming individual calls.
  • Proposal of a new reasoning-driven defense prompt that reduces attack success rates by up to 28.8%, emphasizing the need for holistic context evaluation in defensive strategies.
  • Creation of a benchmark dataset with 483 STAC cases across diverse domains to facilitate further research in agent security.

💡 Why This Paper Matters

The paper is significant as it highlights a critical vulnerability in LLM agents that stems from their ability to use tools in a sequential manner. By demonstrating how benign actions can be chained to produce harmful outcomes, the authors underscore the need for enhanced security measures in AI systems that are increasingly deployed in real-world applications. Furthermore, the introduction of an effective defense approach provides a starting point for addressing these vulnerabilities, making this research relevant for both academia and industry.

🎯 Why It's Interesting for AI Security Researchers

This research is of paramount interest to AI security researchers as it explores an underexamined aspect of LLM vulnerabilities—specifically, the dangers posed by multi-turn interactions that utilize tool chaining. The findings not only illuminate the capabilities of malicious actors but also pave the way for developing more sophisticated safety mechanisms in LLM agents. As AI systems become integral to various sectors, understanding and mitigating these risks is essential for ensuring secure deployments.

📚 Read the Full Paper