← Back to Library

Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment

Authors: Hao Li, Lijun Li, Zhenghao Lu, Xianyi Wei, Rui Li, Jing Shao, Lei Sha

Published: 2025-07-24

arXiv ID: 2507.18631v1

Added to Library: 2025-07-25 04:00 UTC

Safety

📄 Abstract

With rapid advancement and increasing accessibility of LLMs, fine-tuning aligned models has become a critical step for adapting them to real-world applications, which makes the safety of this fine-tuning process more important than ever. However, recent studies have highlighted a critical challenge: even when fine-tuning with seemingly benign downstream datasets, the safety of aligned LLMs can be compromised, making them more susceptible to malicious instructions. In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface. These samples can significantly degrade the safety alignment of LLMs during fine-tuning. To address this issue, we propose LARF, a \textbf{L}ayer-\textbf{A}ware \textbf{R}epresentation \textbf{F}iltering method. This method identifies safety-sensitive layers within the LLM and leverages their representations to detect which data samples in the post-training dataset contain safety-degrading features. Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features. After removing such data, the safety alignment degradation caused by fine-tuning is mitigated. Please see our code at \href{https://github.com/LLLeoLi/LARF}{https://github.com/LLLeoLi/LARF}.

🔍 Key Points

  • Introduction of LARF (Layer-Aware Representation Filtering) as a novel method to filter fine-tuning data that could degrade the safety alignment of large language models (LLMs).
  • LARF identifies safety-sensitive layers in LLMs and uses their representations to detect and remove safety-degrading samples from otherwise benign datasets, significantly improving safety performance.
  • Experimental results demonstrate LARF's superiority in identifying safety-degrading data compared to existing methods, leading to a substantial reduction in Attack Success Rates (ASR) across several benchmarks and model configurations.
  • The findings reveal that benign datasets often contain safety threats that traditional filtering approaches fail to recognize, underlining the necessity of layer-aware representation techniques.
  • LARF has practical applications as a pre-deployment audit tool, allowing developers to ensure model safety before implementing LLMs in sensitive environments.

💡 Why This Paper Matters

This paper is significant due to its introduction of a novel framework that directly addresses safety alignment degradation in LLMs, a critical issue as these models become increasingly integrated into real-world applications. By effectively identifying and filtering out safety-threatening data, LARF enhances the robustness of LLMs against malicious usage, thus supporting safe deployment in sensitive domains.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it tackles the pressing issue of safety vulnerabilities in AI systems. The method presented provides a technical solution to mitigate risks associated with harmful content generation in LLMs, enabling researchers to explore mechanisms that ensure safe and reliable AI behavior. In an era where the security implications of AI are profound, such research contributes to building trust in AI systems, making it a critical read for those focused on securing AI technologies.

📚 Read the Full Paper