← Back to Library

AgenTRIM: Tool Risk Mitigation for Agentic AI

Authors: Roy Betser, Shamik Bose, Amit Giloni, Chiara Picardi, Sindhu Padakandla, Roman Vainshtein

Published: 2026-01-18

arXiv ID: 2601.12449v1

Added to Library: 2026-01-21 03:00 UTC

📄 Abstract

AI agents are autonomous systems that combine LLMs with external tools to solve complex tasks. While such tools extend capability, improper tool permissions introduce security risks such as indirect prompt injection and tool misuse. We characterize these failures as unbalanced tool-driven agency. Agents may retain unnecessary permissions (excessive agency) or fail to invoke required tools (insufficient agency), amplifying the attack surface and reducing performance. We introduce AgenTRIM, a framework for detecting and mitigating tool-driven agency risks without altering an agent's internal reasoning. AgenTRIM addresses these risks through complementary offline and online phases. Offline, AgenTRIM reconstructs and verifies the agent's tool interface from code and execution traces. At runtime, it enforces per-step least-privilege tool access through adaptive filtering and status-aware validation of tool calls. Evaluating on the AgentDojo benchmark, AgenTRIM substantially reduces attack success while maintaining high task performance. Additional experiments show robustness to description-based attacks and effective enforcement of explicit safety policies. Together, these results demonstrate that AgenTRIM provides a practical, capability-preserving approach to safer tool use in LLM-based agents.

🔍 Key Points

  • Introduction of TrojanPraise, a novel attack method that uses benign fine-tuning to exploit vulnerabilities in LLMs.
  • Decoupling of knowledge and attitude dimensions to provide a theoreticalfoundation for understanding the attack mechanism.
  • Experimental validation demonstrating a high attack success rate (up to 95.88%) across multiple models while evading moderation systems.
  • Highlighting the inadequacy of existing moderation models in detecting benign fine-tuning attacks, emphasizing the need for more robust defense mechanisms.
  • Discussion of the ethical implications and limitations of the attack, suggesting areas for future research in AI safety.

💡 Why This Paper Matters

The paper is relevant as it uncovers critical security vulnerabilities inherent in LLM fine-tuning services. By demonstrating how benign data can be weaponized to bypass existing moderation and safety mechanisms, it calls attention to the urgent need for implementing stronger safeguards to protect against such novel attack methods. The findings emphasize that traditional security measures may not be sufficient and that AI systems must evolve to counter new types of risks.

🎯 Why It's Interesting for AI Security Researchers

This research is of paramount relevance to AI security researchers as it reveals a significant gap in the safety and compliance mechanisms of large language models. Understanding the TrojanPraise attack not only helps in recognizing current vulnerabilities but also propels the development of better defenses against adversarial tactics targeting AI systems. The insights from this paper can guide future research into creating more secure AI models, ultimately contributing to safer AI deployment in real-world applications.

📚 Read the Full Paper