When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals

Authors: Riad Ahmed Anonto, Md Labid Al Nahiyan, Md Tanvir Hassan, Ch. Md. Rakin Haider

Published: 2025-11-30

arXiv ID: 2512.01037v1

Added to Library: 2025-12-02 03:01 UTC

Safety

📄 Abstract

Safety-aligned language models often refuse prompts that are actually harmless. Current evaluations mostly report global rates such as false rejection or compliance. These scores treat each prompt alone and miss local inconsistency, where a model accepts one phrasing of an intent but rejects a close paraphrase. This gap limits diagnosis and tuning. We introduce "semantic confusion," a failure mode that captures such local inconsistency, and a framework to measure it. We build ParaGuard, a 10k-prompt corpus of controlled paraphrase clusters that hold intent fixed while varying surface form. We then propose three model-agnostic metrics at the token level: Confusion Index, Confusion Rate, and Confusion Depth. These metrics compare each refusal to its nearest accepted neighbors and use token embeddings, next-token probabilities, and perplexity signals. Experiments across diverse model families and deployment guards show that global false-rejection rate hides critical structure. Our metrics reveal globally unstable boundaries in some settings, localized pockets of inconsistency in others, and cases where stricter refusal does not increase inconsistency. We also show how confusion-aware auditing separates how often a system refuses from how sensibly it refuses. This gives developers a practical signal to reduce false refusals while preserving safety.

🔍 Key Points

Introduction of 'semantic confusion' as a measurable failure mode in safety-aligned language models, highlighting local inconsistencies in LLM refusals.
Development of ParaGuard, a 10,000-prompt paraphrase corpus designed to expose false refusals within tightly controlled semantic variations.
Proposal of three model-agnostic metrics: Confusion Index (CI), Confusion Rate (CR), and Confusion Depth (CD), enabling a granular, token-level analysis of semantic inconsistencies.
Through extensive experiments, the paper demonstrates that global refusal metrics like false rejection rate (FRR) fail to capture critical behavioral patterns, with CI and CR providing deeper insights into model stability and decision-making.
The study shows how confusion-aware auditing can help developers improve LLM safety by distinguishing between how often models refuse prompts and how consistently they do so.

💡 Why This Paper Matters

This paper is significant as it addresses a pressing issue in LLM safety—semantic inconsistencies that lead to false refusals. By introducing a comprehensive framework for measuring semantic confusion and offering novel metrics, the authors provide a toolset that enhances the evaluation and tuning processes for safety-aligned language models. This advancement is crucial for improving model reliability in real-world applications, where user trust is paramount.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly relevant as it highlights the vulnerabilities associated with safety-aligned models, specifically the inconsistency in handling benign prompts. The introduction of metrics to uncover semantic confusion offers opportunities for improving model safety and robustness against potential manipulation, informing both the development of safer AI systems and strengthening defenses against jailbreak attacks.

When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper