← Back to Library

Jailbreaking LLMs & VLMs: Mechanisms, Evaluation, and Unified Defense

Authors: Zejian Chen, Chaozhuo Li, Chao Li, Xi Zhang, Litian Zhang, Yiming He

Published: 2026-01-07

arXiv ID: 2601.03594v1

Added to Library: 2026-01-08 03:03 UTC

Red Teaming

📄 Abstract

This paper provides a systematic survey of jailbreak attacks and defenses on Large Language Models (LLMs) and Vision-Language Models (VLMs), emphasizing that jailbreak vulnerabilities stem from structural factors such as incomplete training data, linguistic ambiguity, and generative uncertainty. It further differentiates between hallucinations and jailbreaks in terms of intent and triggering mechanisms. We propose a three-dimensional survey framework: (1) Attack dimension-including template/encoding-based, in-context learning manipulation, reinforcement/adversarial learning, LLM-assisted and fine-tuned attacks, as well as prompt- and image-level perturbations and agent-based transfer in VLMs; (2) Defense dimension-encompassing prompt-level obfuscation, output evaluation, and model-level alignment or fine-tuning; and (3) Evaluation dimension-covering metrics such as Attack Success Rate (ASR), toxicity score, query/time cost, and multimodal Clean Accuracy and Attribute Success Rate. Compared with prior works, this survey spans the full spectrum from text-only to multimodal settings, consolidating shared mechanisms and proposing unified defense principles: variant-consistency and gradient-sensitivity detection at the perception layer, safety-aware decoding and output review at the generation layer, and adversarially augmented preference alignment at the parameter layer. Additionally, we summarize existing multimodal safety benchmarks and discuss future directions, including automated red teaming, cross-modal collaborative defense, and standardized evaluation.

🔍 Key Points

  • Systematic survey of jailbreak attacks and defenses on LLMs and VLMs, categorizing them into a three-dimensional framework of attack, defense, and evaluation.
  • Proposes a clear distinction between hallucinations and jailbreaks based on intent and mechanisms, enhancing understanding of latter's vulnerabilities.
  • Introduces unified defense principles across different model layers (perception, generation, and parameter), emphasizing the need for comprehensive safety measures.
  • Summarizes the current landscape of multimodal safety benchmarks and suggests future research directions including automated defense measures and standardized evaluation protocols.
  • Highlights the critical importance of addressing jailbreak vulnerabilities for enhancing the safety, reliability, and ethical standards of AI systems.

💡 Why This Paper Matters

This paper provides a comprehensive overview and analysis of jailbreak attacks on LLMs and VLMs, detailing both existing vulnerabilities and potential defenses. Its unified defense framework and emphasis on systematic evaluation metrics offer concrete strategies for improving AI model robustness, making it a vital resource for advancing the field of AI security.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly pertinent to AI security researchers as it sheds light on the multifaceted challenges posed by jailbreak vulnerabilities in LLMs and VLMs. By systematically categorizing attack techniques and defense strategies, it equips researchers with crucial insights for developing effective mitigation mechanisms. Additionally, the exploration of ethical implications and future research avenues aligns with the growing emphasis on responsible AI technology, making it a valuable contribution to the ongoing discourse in this field.

📚 Read the Full Paper