← Back to Library

On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Authors: Stephen Obadinma, Xiaodan Zhu

Published: 2025-07-09

arXiv ID: 2507.06489v1

Added to Library: 2025-07-10 04:00 UTC

Red Teaming

📄 Abstract

Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to ensure transparency, trust, and safety in human-AI interactions across many high-stakes applications. In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce a novel framework for attacking verbal confidence scores through both perturbation and jailbreak-based methods, and show that these attacks can significantly jeopardize verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current confidence elicitation methods are vulnerable and that commonly used defence techniques are largely ineffective or counterproductive. Our findings underscore the urgent need to design more robust mechanisms for confidence expression in LLMs, as even subtle semantic-preserving modifications can lead to misleading confidence in responses.

🔍 Key Points

  • The paper introduces a novel framework for assessing verbal confidence in large language models (LLMs) under adversarial attacks, utilizing both perturbation and jailbreak-based methods.
  • Experiments show that current methods for eliciting verbal confidence are highly vulnerable; the proposed attacks can reduce confidence scores significantly, leading to frequent answer changes (up to 100% in some cases).
  • The study reveals that existing defense techniques against adversarial attacks on verbal confidence are largely ineffective or even counterproductive, highlighting a critical gap in robustness.
  • The authors conduct extensive evaluations across various prompting strategies, model sizes, and application domains, showcasing the widespread vulnerabilities in verbal confidence estimation.
  • This research emphasizes the urgency for developing more robust mechanisms for confidence expression in LLMs, as even minor perturbations can severely distort confidence levels.

💡 Why This Paper Matters

This paper is crucial as it addresses a significant gap in the understanding of verbal confidence in LLMs, particularly under adversarial conditions. The findings stress the necessity for improved methods to enhance the robustness of confidence outputs, which is critical for trustworthy human-AI interaction in high-stakes applications. By demonstrating the vulnerabilities and inefficacies in current defense mechanisms, this work sets a clear agenda for future research and development in the field of AI safety and reliability.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper is of paramount interest because it explores the intersection of adversarial machine learning and natural language processing, identifying potential risks that can be exploited in real-world applications. The insights into the vulnerabilities of verbal confidence in LLMs offer valuable implications for designing more secure AI systems. As LLMs are increasingly integrated into sensitive domains, understanding these risks can lead to better protective measures and improved calibration techniques to ensure safer AI deployments.

📚 Read the Full Paper