Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification

📄 Abstract

Large vision-language models (LVLMs) have shown substantial advances in multimodal understanding and generation. However, when presented with incompetent or adversarial inputs, they frequently produce unreliable or even harmful content, such as fact hallucinations or dangerous instructions. This misalignment with human expectations, referred to as \emph{misbehaviors} of LVLMs, raises serious concerns for deployment in critical applications. These misbehaviors are found to stem from epistemic uncertainty, specifically either conflicting internal knowledge or the absence of supporting information. However, existing uncertainty quantification methods, which typically capture only overall epistemic uncertainty, have shown limited effectiveness in identifying such issues. To address this gap, we propose Evidential Uncertainty Quantification (EUQ), a fine-grained method that captures both information conflict and ignorance for effective detection of LVLM misbehaviors. In particular, we interpret features from the model output head as either supporting (positive) or opposing (negative) evidence. Leveraging Evidence Theory, we model and aggregate this evidence to quantify internal conflict and knowledge gaps within a single forward pass. We extensively evaluate our method across four categories of misbehavior, including hallucinations, jailbreaks, adversarial vulnerabilities, and out-of-distribution (OOD) failures, using state-of-the-art LVLMs, and find that EUQ consistently outperforms strong baselines, showing that hallucinations correspond to high internal conflict and OOD failures to high ignorance. Furthermore, layer-wise evidential uncertainty dynamics analysis helps interpret the evolution of internal representations from a new perspective. The source code is available at https://github.com/HT86159/EUQ.

🔍 Key Points

Proposes a novel method called Evidential Uncertainty Quantification (EUQ) to quantify two types of epistemic uncertainty (conflict and ignorance) in large vision-language models (LVLMs).
Demonstrates that EUQ can effectively differentiate between various misbehaviors (hallucinations, jailbreaks, adversarial vulnerabilities, and OOD failures) using state-of-the-art LVLMs, consistently outperforming existing baselines in detection performance.
Introduces a layer-wise analysis of evidential uncertainty to interpret the dynamics of internal representations in LVLMs, revealing that ignorance decreases and conflict increases across deeper layers of the model.
Shows empirical evidence that high internal conflict corresponds to hallucinations while high ignorance relates to OOD failures, providing insights into the behavior of LVLMs under misaligned expectations with human judgments.
Releases source code for EUQ, ensuring reproducibility and providing a tool for further research in the field.

💡 Why This Paper Matters

This paper makes significant strides in understanding and quantifying the uncertainties in large vision-language models, providing a robust framework for improving the reliability and safety of AI technologies that interplay with visual and textual data. The innovative approach not only enhances the detection of misbehaviors in LVLMs but also offers new avenues for interpreting model behavior, making it crucial for advancing AI deployment in critical areas.

🎯 Why It's Interesting for AI Security Researchers

The findings and methodology presented in this paper are of paramount interest to AI security researchers as they tackle the critical issue of ensuring the safety and reliability of AI models. Understanding the sources of misalignments between model outputs and human expectations can lead to more secure models, capable of resisting adversarial inputs and operating reliably in diverse real-world scenarios. Such insights are vital for the broader adoption of AI in sensitive domains like healthcare and autonomous systems.

Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper