← Back to Library

From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models

Authors: Haibo Jin, Peiyan Zhang, Peiran Wang, Man Luo, Haohan Wang

Published: 2025-05-30

arXiv ID: 2505.24232v1

Added to Library: 2025-06-02 03:01 UTC

Red Teaming

📄 Abstract

Large foundation models (LFMs) are susceptible to two distinct vulnerabilities: hallucinations and jailbreak attacks. While typically studied in isolation, we observe that defenses targeting one often affect the other, hinting at a deeper connection. We propose a unified theoretical framework that models jailbreaks as token-level optimization and hallucinations as attention-level optimization. Within this framework, we establish two key propositions: (1) \textit{Similar Loss Convergence} - the loss functions for both vulnerabilities converge similarly when optimizing for target-specific outputs; and (2) \textit{Gradient Consistency in Attention Redistribution} - both exhibit consistent gradient behavior driven by shared attention dynamics. We validate these propositions empirically on LLaVA-1.5 and MiniGPT-4, showing consistent optimization trends and aligned gradients. Leveraging this connection, we demonstrate that mitigation techniques for hallucinations can reduce jailbreak success rates, and vice versa. Our findings reveal a shared failure mode in LFMs and suggest that robustness strategies should jointly address both vulnerabilities.

🔍 Key Points

  • Identifies and formalizes the interplay between hallucinations and jailbreaks in large foundation models, proposing a unified optimization framework that interconnects these vulnerabilities.
  • Establishes two theoretical propositions about the similarity of loss convergence and gradient consistency in attention redistribution between hallucinations and jailbreaks.
  • Empirical validation of theoretical propositions on LLaVA-1.5 and MiniGPT-4 reveals that defenses for hallucinations can also mitigate jailbreaks and vice versa, demonstrating a shared failure mode in LFMs.
  • Provides insights into cross-domain mitigation strategies, suggesting practical approaches for enhancing the robustness of large foundation models against these vulnerabilities.

💡 Why This Paper Matters

This paper is pivotal in changing how vulnerabilities in large foundation models are understood and addressed. By connecting the dots between hallucinations and jailbreaks, it presents a holistic view that enhances strategies for robustness against adversarial manipulation and misalignment, crucial for developing safer AI systems.

🎯 Why It's Interesting for AI Security Researchers

The findings within this paper are significant for AI security researchers because they provide a novel theoretical and empirical framework for understanding and mitigating vulnerabilities in large foundation models. The ability to simultaneously address multiple vulnerabilities through shared mitigation strategies presents promising avenues for improving the safety and reliability of AI systems in real-world applications.

📚 Read the Full Paper