← Back to Library

Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks

Authors: Xiaodong Wu, Xiangman Li, Jianbing Ni

Published: 2025-06-23

arXiv ID: 2506.18543v1

Added to Library: 2025-06-24 04:00 UTC

Red Teaming

📄 Abstract

The widespread deployment of large language models (LLMs) has raised critical concerns over their vulnerability to jailbreak attacks, i.e., adversarial prompts that bypass alignment mechanisms and elicit harmful or policy-violating outputs. While proprietary models like GPT-4 have undergone extensive evaluation, the robustness of emerging open-source alternatives such as DeepSeek remains largely underexplored, despite their growing adoption in real-world applications. In this paper, we present the first systematic jailbreak evaluation of DeepSeek-series models, comparing them with GPT-3.5 and GPT-4 using the HarmBench benchmark. We evaluate seven representative attack strategies across 510 harmful behaviors categorized by both function and semantic domain. Our analysis reveals that DeepSeek's Mixture-of-Experts (MoE) architecture introduces routing sparsity that offers selective robustness against optimization-based attacks such as TAP-T, but leads to significantly higher vulnerability under prompt-based and manually engineered attacks. In contrast, GPT-4 Turbo demonstrates stronger and more consistent safety alignment across diverse behaviors, likely due to its dense Transformer design and reinforcement learning from human feedback. Fine-grained behavioral analysis and case studies further show that DeepSeek often routes adversarial prompts to under-aligned expert modules, resulting in inconsistent refusal behaviors. These findings highlight a fundamental trade-off between architectural efficiency and alignment generalization, emphasizing the need for targeted safety tuning and modular alignment strategies to ensure secure deployment of open-source LLMs.

🔍 Key Points

  • This paper presents the first comprehensive systematic evaluation of the jailbreak resistance between DeepSeek-series models and GPT-series models using the HarmBench benchmark, providing a direct comparison of their robustness against various attack strategies.
  • The analysis reveals a fundamental trade-off between architectural efficiency and alignment robustness, highlighting that while DeepSeek's Mixture-of-Experts architecture can provide some resilience against specific automated attacks, it remains vulnerable to direct prompt-based attacks.
  • The paper categorizes 510 harmful behaviors into functional and semantic domains, showcasing how DeepSeek performs consistently worse than GPT-4 across most categories, particularly in high-risk content areas like misinformation and cybercrime.
  • Fine-grained behavioral analyses indicate that DeepSeek often routes adversarial prompts to under-aligned expert modules, while GPT-4's dense architecture results in stronger and more consistent safety performances across diverse attack types.
  • The study emphasizes the necessity for targeted safety tuning and modular alignment strategies for open-source LLMs such as DeepSeek to enhance their security in real-world applications.

💡 Why This Paper Matters

This paper is highly relevant as it addresses the critical security implications of deploying large language models in real-world applications, especially concerning their vulnerabilities to jailbreak attacks. By systematically exploring and comparing the robustness of DeepSeek and GPT models, the authors not only shed light on the strengths and weaknesses of these architectures but also provide valuable insights for researchers and developers working on enhancing AI safety measures.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper of significant interest due to its exploration of jailbreak attacks, a growing concern in the field of AI safety. The detailed comparison between different model architectures, the introduction of the HarmBench benchmark for evaluating attacks, and insights into the specific vulnerabilities of open-source models like DeepSeek provide a crucial foundation for future research aimed at strengthening the resilience of language models against adversarial prompts.

📚 Read the Full Paper