EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models

📄 Abstract

Many machine learning models are fine-tuned from large language models (LLMs) to achieve high performance in specialized domains like code generation, biomedical analysis, and mathematical problem solving. However, this fine-tuning process often introduces a critical vulnerability: the systematic degradation of safety alignment, undermining ethical guidelines and increasing the risk of harmful outputs. Addressing this challenge, we introduce EnchTable, a novel framework designed to transfer and maintain safety alignment in downstream LLMs without requiring extensive retraining. EnchTable leverages a Neural Tangent Kernel (NTK)-based safety vector distillation method to decouple safety constraints from task-specific reasoning, ensuring compatibility across diverse model architectures and sizes. Additionally, our interference-aware merging technique effectively balances safety and utility, minimizing performance compromises across various task domains. We implemented a fully functional prototype of EnchTable on three different task domains and three distinct LLM architectures, and evaluated its performance through extensive experiments on eleven diverse datasets, assessing both utility and model safety. Our evaluations include LLMs from different vendors, demonstrating EnchTable's generalization capability. Furthermore, EnchTable exhibits robust resistance to static and dynamic jailbreaking attacks, outperforming vendor-released safety models in mitigating adversarial prompts. Comparative analyses with six parameter modification methods and two inference-time alignment baselines reveal that EnchTable achieves a significantly lower unsafe rate, higher utility score, and universal applicability across different task domains. Additionally, we validate EnchTable can be seamlessly integrated into various deployment pipelines without significant overhead.

🔍 Key Points

The paper systematically investigates linguistic styles as a vulnerability vector for jailbreak attacks on large language models (LLMs), proposing a novel angle of analysis beyond semantic perturbations.
It constructs a style-augmented jailbreak benchmark using templates and LLM-generated rewrites across multiple emotional and pragmatic styles, demonstrating how these variations increase the effectiveness of jailbreak attempts.
Results show significant increases in jailbreak success rates (up to +57 percentage points) with styles like fear, curiosity, and compassion proving most effective, revealing a previously overlooked dimension in model alignment safety concerns.
The authors introduce and test a style-neutralization preprocessing step using a secondary LLM, which significantly lowers the success rates of jailbreak attempts by stripping harmful stylistic cues from user inputs.
The methodology and findings highlight a gap in current safety protocols and suggest that linguistic variations should be integrated into safety evaluations and model alignment practices.

💡 Why This Paper Matters

This paper sheds light on the underappreciated impact of linguistic styles on the robustness of large language models against jailbreak attacks, emphasizing the need for renewed focus on this vulnerability in model training and evaluation. The introduction of a style-neutralization strategy offers a promising mitigation path, making the findings relevant for improving AI safety mechanisms.

🎯 Why It's Interesting for AI Security Researchers

The paper is highly relevant to AI security researchers as it uncovers a novel attack vector—linguistic styles—demonstrating that the effectiveness of adversarial prompts can be significantly influenced by emotional and contextual framing. This challenges existing paradigms that primarily address semantic variations and opens new avenues for developing robust defenses against AI exploits.

EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper