โ† Back to Library

ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal

Authors: Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Jiashui Wang, Xinlei Ying, Long Liu, Wenhai Wang

Published: 2025-08-15

arXiv ID: 2508.11222v1

Added to Library: 2025-08-18 04:00 UTC

Safety

๐Ÿ“„ Abstract

Large Language Models (LLMs) increasingly exhibit over-refusal - erroneously rejecting benign queries due to overly conservative safety measures - a critical functional flaw that undermines their reliability and usability. Current methods for testing this behavior are demonstrably inadequate, suffering from flawed benchmarks and limited test generation capabilities, as highlighted by our empirical user study. To the best of our knowledge, this paper introduces the first evolutionary testing framework, ORFuzz, for the systematic detection and analysis of LLM over-refusals. ORFuzz uniquely integrates three core components: (1) safety category-aware seed selection for comprehensive test coverage, (2) adaptive mutator optimization using reasoning LLMs to generate effective test cases, and (3) OR-Judge, a human-aligned judge model validated to accurately reflect user perception of toxicity and refusal. Our extensive evaluations demonstrate that ORFuzz generates diverse, validated over-refusal instances at a rate (6.98% average) more than double that of leading baselines, effectively uncovering vulnerabilities. Furthermore, ORFuzz's outputs form the basis of ORFuzzSet, a new benchmark of 1,855 highly transferable test cases that achieves a superior 63.56% average over-refusal rate across 10 diverse LLMs, significantly outperforming existing datasets. ORFuzz and ORFuzzSet provide a robust automated testing framework and a valuable community resource, paving the way for developing more reliable and trustworthy LLM-based software systems.

๐Ÿ” Key Points

  • Introduction of ORFuzz, the first evolutionary testing framework designed to detect over-refusal behavior in Large Language Models (LLMs).
  • Integration of three core components: safety category-aware seed selection, adaptive mutator optimization using reasoning LLMs, and OR-Judge model for human-aligned validation of outputs.
  • Demonstrated that ORFuzz outperforms existing methods by generating over-refusal instances at a doubled rate compared to leading benchmarks, with a high efficacy observed across 10 diverse LLMs.
  • Creation of ORFuzzSet, a new benchmark dataset of 1,855 effective test cases exhibiting a 63.56% average over-refusal rate across various models, significantly outperforming previous datasets.
  • Highlighting critical shortcomings in existing testing methodologies through empirical user studies; provides a validated framework for developing more reliable LLM systems.

๐Ÿ’ก Why This Paper Matters

This paper is crucial as it addresses a significant gap in the evaluation of Large Language Modelsโ€”specifically, testing for over-refusal behavior. By introducing ORFuzz and ORFuzzSet, the authors provide robust tools and methods that can vastly improve the safety and usability of LLMs, making them more dependable for critical applications in society. The findings affirm the importance of addressing the duality in LLM safety and reliability, paving the way for future research and development in more balanced AI systems.

๐ŸŽฏ Why It's Interesting for AI Security Researchers

For AI security researchers, this paper is pertinent as it tackles the growing concern of LLMs' safety measures leading to over-refusal, which can hamper their practical utility. Understanding the vulnerabilities exposed by ORFuzz enables researchers to pinpoint where LLMs may err on the side of caution, leading to practical implications for security assessments and the development of more nuanced defenses against both over-censorship and harmful outputs. Additionally, the methodologies and datasets proposed could foster further research into more resilient AI safety protocols.

๐Ÿ“š Read the Full Paper