← Back to Library

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

Authors: Yutao Mou, Zhangchi Xue, Lijun Li, Peiyang Liu, Shikun Zhang, Wei Ye, Jing Shao

Published: 2026-01-15

arXiv ID: 2601.10156v1

Added to Library: 2026-01-16 03:03 UTC

Safety

πŸ“„ Abstract

While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.

πŸ” Key Points

  • Introduction of TS-Bench, a benchmark specifically designed for step-level tool invocation safety detection in LLM-based agents, filling a significant gap in the existing literature on agent safety.
  • Development of TS-Guard, a multi-task reinforcement learning-based guardrail model capable of assessing harmfulness and preventing unsafe tool invocations through detailed interaction history analysis.
  • Introduction of TS-Flow, a feedback-driven reasoning framework that not only prevents unsafe actions but also enhances benign task completion rates under attack scenarios, effectively balancing safety and utility.
  • Demonstrated that the integration of step-level guardrails can decrease harmful tool invocations by 65% while improving benign task completion by approximately 10%, indicating a clear performance boost in LLM-based agents with active safety monitoring.
  • Empirical results from extensive experiments provide insights into the effectiveness of the proposed methods compared to existing guardrail frameworks, highlighting both improvements in safety and utility metrics.

πŸ’‘ Why This Paper Matters

This paper is highly relevant as it addresses pressing security concerns associated with the deployment of LLM-based agents that have the capability to invoke external tools. By establishing rigorous safety checks at the step level and providing proactive feedback mechanisms, the authors not only contribute to advancing research in AI safety but also practical frameworks that can enhance real-world applications of LLM agents. The findings demonstrate a successful method to mitigate security risks while maintaining operational effectiveness, highlighting a significant step toward safer AI deployments.

🎯 Why It's Interesting for AI Security Researchers

The research presented in this paper will be of particular interest to AI security researchers as it tackles the critical issue of tool invocation safety in LLM-based agentsβ€”a rapidly growing area with significant implications for autonomous systems. The introduction of a bespoke benchmark and novel guardrail approaches provides both theoretical frameworks and practical implementations to enhance the safety of AI agents. Additionally, the empirical insights on safety and utility trade-offs contribute valuable knowledge for developing more resilient and secure AI systems capable of operating safely in unpredictable environments.

πŸ“š Read the Full Paper