Probing the Robustness of Large Language Models Safety to Latent Perturbations

Authors: Tianle Gu, Kexin Huang, Zongqi Wang, Yixu Wang, Jie Li, Yuanqi Yao, Yang Yao, Yujiu Yang, Yan Teng, Yingchun Wang

Published: 2025-06-19

arXiv ID: 2506.16078v1

Added to Library: 2025-06-23 04:02 UTC

Red Teaming

📄 Abstract

Safety alignment is a key requirement for building reliable Artificial General Intelligence. Despite significant advances in safety alignment, we observe that minor latent shifts can still trigger unsafe responses in aligned models. We argue that this stems from the shallow nature of existing alignment methods, which focus on surface-level refusal behaviors without sufficiently altering internal representations. Consequently, small shifts in hidden activations can re-trigger harmful behaviors embedded in the latent space. To explore the robustness of safety alignment to latent perturbations, we introduce a probing method that measures the Negative Log-Likelihood of the original response generated by the model. This probe quantifies local sensitivity in the latent space, serving as a diagnostic tool for identifying vulnerable directions. Based on this signal, we construct effective jailbreak trajectories, giving rise to the Activation Steering Attack (ASA). More importantly, these insights offer a principled foundation for improving alignment robustness. To this end, we introduce Layer-wise Adversarial Patch Training~(LAPT), a fine-tuning strategy that inject controlled perturbations into hidden representations during training. Experimental results highlight that LAPT strengthen alignment robustness without compromising general capabilities. Our findings reveal fundamental flaws in current alignment paradigms and call for representation-level training strategies that move beyond surface-level behavior supervision. Codes and results are available at https://github.com/Carol-gutianle/LatentSafety.

🔍 Key Points

Identification of fundamental structural vulnerabilities in the safety alignment of large language models, highlighting insufficient local robustness in latent representations.
Introduction of the Activation Steering Attack (ASA), a novel attack method utilizing Negative Log-Likelihood probing to detect latent adversarial vulnerabilities in aligned models.
Development of ASABench, a comprehensive benchmark for evaluating the latent robustness of safety-aligned models with over 4,862 validated attack instances.
Proposal of Layer-wise Adversarial Patch Training (LAPT), a targeted fine-tuning strategy that significantly enhances alignment robustness while preserving general task performance.
Empirical evidence demonstrating the cumulative effects of attacks, implying that small internal perturbations can lead to substantial safety violations in language models.

💡 Why This Paper Matters

This paper is pivotal in illuminating the latent vulnerabilities present in safety-aligned large language models, providing actionable methodologies for assessing and improving their robustness. By introducing novel attack strategies and evaluation frameworks, it emphasizes the need for a deeper understanding of internal representations to build safer AI systems, which is critical for the advancement of responsible AI deployment.

🎯 Why It's Interesting for AI Security Researchers

The findings and methodologies presented in this paper are highly relevant to AI security researchers as they uncover critical weaknesses in current alignment paradigms. The introduction of systematic evaluation tools and adversarial training strategies offers significant implications for developing more robust AI systems, thereby contributing to enhanced safety measures against potential misuse and failures in AI models.

Probing the Robustness of Large Language Models Safety to Latent Perturbations

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper