← Back to Library

Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety

Authors: Vamshi Krishna Bonagiri, Ponnurangam Kumaragurum, Khanh Nguyen, Benjamin Plaut

Published: 2025-10-18

arXiv ID: 2510.16492v1

Added to Library: 2025-10-21 04:04 UTC

Safety

πŸ“„ Abstract

As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn agentic scenarios with real-world tool access present unique challenges where uncertainties and ambiguities compound, leading to severe or catastrophic risks beyond traditional text generation failures. We propose using "quitting" as a simple yet effective behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence. Leveraging the ToolEmu framework, we conduct a systematic evaluation of quitting behavior across 12 state-of-the-art LLMs. Our results demonstrate a highly favorable safety-helpfulness trade-off: agents prompted to quit with explicit instructions improve safety by an average of +0.39 on a 0-3 scale across all models (+0.64 for proprietary models), while maintaining a negligible average decrease of -0.03 in helpfulness. Our analysis demonstrates that simply adding explicit quit instructions proves to be a highly effective safety mechanism that can immediately be deployed in existing agent systems, and establishes quitting as an effective first-line defense mechanism for autonomous agents in high-stakes applications.

πŸ” Key Points

  • Introduction of 'quitting' as a safety mechanism for LLM agents to withdraw from risky situations.
  • Systematic evaluation of quitting behavior across 12 state-of-the-art LLMs using the ToolEmu framework.
  • Demonstrated favorable safety-helpfulness trade-off with an average safety improvement of +0.39 with minimal helpfulness decrease of -0.03.
  • Proprietary models showed greater sensitivity to quitting prompts compared to open-source models, emphasizing a gap in instruction-following capability.
  • Findings suggest regulatory implications for LLM safety mechanisms, advocating for mandated quitting instructions in high-stakes applications.

πŸ’‘ Why This Paper Matters

The paper is highly relevant as it provides a novel approach to enhancing the safety of LLM agents through explicit quitting instructions. By demonstrating that strategic quitting can improve safety with minimal impact on helpfulness, the authors contribute a practical framework that can be promptly implemented in existing systems. This work not only addresses immediate safety concerns but also lays the groundwork for future regulatory guidelines in AI deployment.

🎯 Why It's Interesting for AI Security Researchers

This paper is of particular interest to AI security researchers as it explores a vital aspect of LLM safetyβ€”the ability of agents to recognize and refrain from engaging in high-risk actions. It highlights how traditional LLM deployment methods can fall short in complex, real-world scenarios, suggesting new directions for improving agent safety. The findings could inform the development of robust safety protocols and best practices for the responsible use of AI in sensitive applications, making it essential reading for those focused on AI risk mitigation.

πŸ“š Read the Full Paper