← Back to Library

Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment

Authors: Hao Li, Lijun Li, Zhenghao Lu, Xianyi Wei, Rui Li, Jing Shao, Lei Sha

Published: 2025-07-24

arXiv ID: 2507.18631v2

Added to Library: 2025-07-28 01:00 UTC

Safety

📄 Abstract

With rapid advancement and increasing accessibility of LLMs, fine-tuning aligned models has become a critical step for adapting them to real-world applications, which makes the safety of this fine-tuning process more important than ever. However, recent studies have highlighted a critical challenge: even when fine-tuning with seemingly benign downstream datasets, the safety of aligned LLMs can be compromised, making them more susceptible to malicious instructions. In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface. These samples can significantly degrade the safety alignment of LLMs during fine-tuning. To address this issue, we propose LARF, a Layer-Aware Representation Filtering method. This method identifies safety-sensitive layers within the LLM and leverages their representations to detect which data samples in the post-training dataset contain safety-degrading features. Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features. After removing such data, the safety alignment degradation caused by fine-tuning is mitigated. Please see our code at https://github.com/LLLeoLi/LARF.

🔍 Key Points

  • Introduces LARF (Layer-Aware Representation Filtering), a novel method to identify and filter safety-degrading data points from fine-tuning datasets for large language models (LLMs).
  • Demonstrates that fine-tuning on benign datasets can still degrade safety alignment due to the presence of subtle safety-degrading examples, emphasizing the need for effective data filtering methods.
  • Experimental results show that LARF achieves state-of-the-art performance in identifying and excluding harmful training data, significantly improving safety alignment without requiring extensive computational resources.
  • The paper details how LARF captures safety-sensitive layers in LLMs, enhancing the identification of data that may undermine the models' safety mechanisms through bidirectional representation similarity calculations.
  • Highlights practical implications for deploying LLMs in sensitive applications, ensuring they maintain alignment with human safety standards post-fine-tuning.

💡 Why This Paper Matters

The paper is pivotal in advancing the safety considerations of fine-tuning large language models. By proposing the LARF method, it addresses a critical vulnerability in the model training pipeline, where even seemingly innocuous data can lead to harmful outcomes. This work not only contributes a novel algorithm for data filtering but also underscores the broader implications for the deployment of AI systems in sensitive areas, promoting safer AI practices.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers because it tackles the pressing issue of safety alignment in LLMs, a key concern as these models become increasingly integrated into critical applications. The demonstration of vulnerabilities due to benign data and the introduction of a systematic approach to mitigate these risks provide actionable insights for enhancing AI safety protocols. Researchers focused on security will find the methodology and findings particularly pertinent for developing robust defenses against data-driven threats.

📚 Read the Full Paper