Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety

📄 Abstract

Large Language Model (LLM) agents increasingly act through external tools, making their safety contingent on tool-call workflows rather than text generation alone. While recent benchmarks evaluate agents across diverse environments and risk categories, a fundamental question remains unanswered: how complete are existing test suites, and what unsafe interaction patterns persist even after an agent passes the benchmark? We propose SafeAudit, a meta-audit framework that addresses this gap through two contributions. First, an LLM-based enumerator that systematically generates test cases by enumerating valid tool-call workflows and diverse user scenarios. Second, we introduce rule-resistance, a non-semantic, quantitative metric that distills compact safety rules from existing benchmarks and identifies unsafe interaction patterns that remain uncovered under those rules. Across 3 benchmarks and 12 environments, SafeAudit uncovers more than 20% residual unsafe behaviors that existing benchmarks fail to expose, with coverage growing monotonically as the testing budget increases. Our results highlight significant completeness gaps in current safety evaluation and motivate meta-auditing as a necessary complement to benchmark-based agent safety testing.

🔍 Key Points

Introduction of SafeAudit: a framework for systematically auditing the safety of LLM agent tool calls, focusing on gaps in existing benchmarks.
Development of an LLM-based enumerator to generate diverse test scenarios and workflows, enhancing the comprehensiveness of safety evaluations.
Introduction of the rule-resistance metric to quantitatively assess the completeness of safety benchmarks and identify residual unsafe patterns that evade detection.
Empirical validation showing that SafeAudit uncovers over 20% unsafe behaviors not identified by existing benchmarks across multiple environments and tools.
Demonstrated practical utility of SafeAudit in revealing blind spots in LLM safety evaluation, encouraging the adoption of meta-auditing alongside traditional benchmark assessments.

💡 Why This Paper Matters

This paper presents a pivotal advancement in LLM safety evaluation through the introduction of SafeAudit, which systematically addresses and reveals gaps in existing agent safety benchmarks. The findings emphasize the necessity for comprehensive auditing approaches that go beyond traditional metrics to ensure safer deployment of LLM agents in real-world applications. By identifying novel unsafe interaction patterns and measuring the effectiveness of existing benchmarks, the paper lays the groundwork for future work aimed at enhancing the reliability and safety of LLM technologies.

🎯 Why It's Interesting for AI Security Researchers

The findings and methodologies described in this paper are highly relevant for AI security researchers as they expose critical weaknesses in current evaluation frameworks for LLM agents. By introducing SafeAudit and illustrating its efficacy in revealing previously unnoticed unsafe interactions, the paper provides a significant contribution to the field of AI safety. It encourages researchers to reconsider the adequacy of existing benchmarks and highlights the importance of thorough auditing processes that can help mitigate risks associated with LLM deployment in sensitive applications. Consequently, this work holds substantial implications for improving the security and trustworthiness of AI systems.

Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper