SafeSeek: Universal Attribution of Safety Circuits in Language Models

Authors: Miao Yu, Siyuan Fu, Moayad Aloqaily, Zhenhong Zhou, Safa Otoum, Xing fan, Kun Wang, Yufei Guo, Qingsong Wen

Published: 2026-03-24

arXiv ID: 2603.23268v1

Added to Library: 2026-03-25 03:01 UTC

Red Teaming

📄 Abstract

Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits in LLMs via optimization. Unlike methods focusing on isolated heads or neurons, \ourmethod introduces differentiable binary masks to extract multi-granular circuits through gradient descent on safety datasets, while integrates Safety Circuit Tuning to utilize these sparse circuits for efficient safety fine-tuning. We validate \ourmethod in two key scenarios in LLM safety: \textbf{(1) backdoor attacks}, identifying a backdoor circuit with 0.42\% sparsity, whose ablation eradicates the Attack Success Rate (ASR) from 100\% $\to$ 0.4\% while retaining over 99\% general utility; \textbf{(2) safety alignment}, localizing an alignment circuit with 3.03\% heads and 0.79\% neurons, whose removal spikes ASR from 0.8\% $\to$ 96.9\%, whereas excluding this circuit during helpfulness fine-tuning maintains 96.5\% safety retention.

🔍 Key Points

Introduction of SafeSeek, a unified framework for identifying safety circuits in LLMs using gradient-based optimization, which is more efficient than previous heuristic methods.
Demonstration of the framework's ability to locate sparse safety circuits linked to backdoor attacks and alignment, with detailed experimental results showing significant improvements in safety metrics without sacrificing general utility.
Validation of SafeSeek through extensive experiments on two safety scenarios: backdoor attacks and safety alignment, indicating that safety-critical behaviors can be isolated and modified effectively with minimal performance impact.
Introduction of Safety Circuit Tuning (SaCirT) as a method for fine-tuning safety circuits independently, preserving or enhancing safety while mitigating the 'alignment tax' during model optimization.
Analysis demonstrating the sparsity and orthogonality of identified backdoor and safety circuits in LLM architectures, suggesting that backdoor triggers are decoupled from normal functionality.

💡 Why This Paper Matters

The research presented in this paper is essential for advancing the interpretability and safety of large language models. By introducing SafeSeek, the authors provide a reliable framework for analyzing safety circuits within these models, enabling the identification and modification of malicious behaviors without compromising overall model utility. The study's results not only enhance our understanding of model safety but also propose practical solutions for mitigating threats associated with adversarial attacks in AI systems.

🎯 Why It's Interesting for AI Security Researchers

This paper is of high relevance to AI security researchers as it tackles critical issues regarding the safety and robustness of large language models. The methodologies developed for isolating and tuning safety circuits will be invaluable in efforts to understand and improve model defenses against adversarial attacks and ensure that LLMs align more closely with human values. Furthermore, by openly addressing the vulnerabilities associated with these models, researchers can develop more resilient AIs that better protect against potential misuse.

SafeSeek: Universal Attribution of Safety Circuits in Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper