Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

Authors: Wenpeng Xing, Mohan Li, Chunqiang Hu, Haitao XuNingyu Zhang, Bo Lin, Meng Han

Published: 2025-08-08

arXiv ID: 2508.10029v1

Added to Library: 2025-08-15 04:02 UTC

Red Teaming

📄 Abstract

Large language models (LLMs) demonstrate impressive capabilities in various language tasks but are susceptible to jailbreak attacks that circumvent their safety alignments. This paper introduces Latent Fusion Jailbreak (LFJ), a representation-based attack that interpolates hidden states from harmful and benign query pairs to elicit prohibited responses. LFJ begins by selecting query pairs with high thematic and syntactic similarity, then performs gradient-guided interpolation at influential layers and tokens, followed by optimization to balance attack success, output fluency, and computational efficiency. Evaluations on models such as Vicuna and LLaMA-2 across benchmarks like AdvBench and MaliciousInstruct yield an average attack success rate (ASR) of 94.01%, outperforming existing methods. To mitigate LFJ, we propose an adversarial training defense that fine-tunes models on interpolated examples, reducing ASR by over 80% without degrading performance on benign inputs. Ablation studies validate the importance of query pair selection, hidden state interpolation components, and optimization strategies in LFJ's effectiveness.

🔍 Key Points

Introduction of Latent Fusion Jailbreak (LFJ), a representation-based attack that successfully elicits harmful outputs from large language models (LLMs) through hidden state interpolation.
Achieved an average attack success rate (ASR) of 94.01% across multiple models and benchmarks, significantly surpassing existing jailbreak techniques.
Proposed an effective adversarial training defense that fine-tunes LLMs on adversarial examples derived from LFJ, reducing ASR by over 80% while maintaining performance on benign inputs.
Conducted comprehensive ablation studies that provide insight into the crucial components of the LFJ methodology, such as query pair selection and gradient-guided optimization.
Demonstrated the practical implications of LFJ for safety alignment in LLMs, raising concerns about the robustness of current safety mechanisms and compliance systems.

💡 Why This Paper Matters

This paper is relevant and important because it exposes vulnerabilities in LLMs concerning safety and alignment, providing both a practical attack methodology (LFJ) and a robust defense strategy. By quantifying the efficiency of these approaches through experimental evaluation, it serves as a wake-up call to researchers and practitioners about the potential for malicious exploitation of language models and the necessity for enhanced security measures.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of great interest to AI security researchers as it not only uncovers a novel method for bypassing the safety measures in LLMs but also proposes actionable defenses. The effectiveness of LFJ in real-world scenarios highlights the urgent need for improved safety protocols and a deeper understanding of LLM vulnerabilities, thus presenting rich avenues for future research in AI safety and security.

Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper