CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

📄 Abstract

As safety concerns around large language models (LLMs) grow, understanding the internal mechanisms underlying refusal behavior has become increasingly important. Recent work has studied this behavior by identifying internal features associated with refusal and manipulating them to induce compliance with harmful requests. However, existing refusal feature selection methods rely on how strongly features activate on harmful prompts, which tends to capture superficial signals rather than the causal factors underlying the refusal decision. We propose CRaFT, a circuit-guided refusal feature selection framework that ranks features by their influence on the model's refusal-compliance decision using prompts near the refusal boundary. On Gemma-3-1B-it, CRaFT improves attack success rate (ASR) from 6.7% to 48.2% and outperforms baseline methods across multiple jailbreak benchmarks. These results suggest that circuit influence is a more reliable criterion than activation magnitude for identifying features that causally mediate refusal behavior.

🔍 Key Points

Introduction of CRaFT, a novel method for circuit-guided refusal feature selection in large language models (LLMs) that focuses on causal factors rather than superficial feature activation.
Utilization of boundary-critical sampling to isolate features influencing refusal-compliance decisions, which offers a more controlled and mechanistic view of innate model behavior.
Demonstration of significant improvements in the attack success rates (ASR) through the CRaFT framework, showcasing an increase from 6.7% to 48.2%, outperforming baseline techniques across multiple benchmarks.
The findings indicate that circuit influence is a more dependable metric for identifying effective refusal features compared to traditional activation magnitude methods.
A comprehensive analysis of generated outputs revealing significant improvements in the specificity and convincingness of responses when using CRaFT.

💡 Why This Paper Matters

The research presented in this paper is critical as it advances the understanding of refusal mechanisms within large language models, a topic of increasing relevance due to concerns around the safety and reliability of AI systems. By introducing the CRaFT framework, the authors provide a robust method for not only understanding but also manipulating model behavior to enhance compliance while maintaining ethical safeguards. This approach aids in developing better models that responsibly handle potentially harmful prompts.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it directly addresses the vulnerabilities inherent in large language models, particularly their capacity for refusal to comply with harmful prompts. The insights gained from the CRaFT framework can inform better design and implementations of safety measures in AI models, reducing the risk of misuse and enhancing the robustness of responses in sensitive applications. Furthermore, it highlights the importance of understanding internal model mechanisms, which is crucial for creating models that align with human values and ethical guidelines.

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper