← Back to Library

SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security

Authors: Wei Zhao, Zhe Li, Jun Sun

Published: 2025-12-04

arXiv ID: 2512.04841v1

Added to Library: 2025-12-05 03:01 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) exhibit remarkable capabilities but remain vulnerable to adversarial manipulations such as jailbreaking, where crafted prompts bypass safety mechanisms. Understanding the causal factors behind such vulnerabilities is essential for building reliable defenses. In this work, we introduce a unified causality analysis framework that systematically supports all levels of causal investigation in LLMs, ranging from token-level, neuron-level, and layer-level interventions to representation-level analysis. The framework enables consistent experimentation and comparison across diverse causality-based attack and defense methods. Accompanying this implementation, we provide the first comprehensive survey of causality-driven jailbreak studies and empirically evaluate the framework on multiple open-weight models and safety-critical benchmarks including jailbreaks, hallucination detection, backdoor identification, and fairness evaluation. Our results reveal that: (1) targeted interventions on causally critical components can reliably modify safety behavior; (2) safety-related mechanisms are highly localized (i.e., concentrated in early-to-middle layers with only 1--2\% of neurons exhibiting causal influence); and (3) causal features extracted from our framework achieve over 95\% detection accuracy across multiple threat types. By bridging theoretical causality analysis and practical model safety, our framework establishes a reproducible foundation for research on causality-based attacks, interpretability, and robust attack detection and mitigation in LLMs. Code is available at https://github.com/Amadeuszhao/SOK_Casuality.

🔍 Key Points

  • Introduction of a unified causality analysis framework for large language models (LLMs) that facilitates investigation at token, neuron, layer, and representation levels which allows for comprehensive manipulation and understanding of model behavior.
  • Empirical evaluations demonstrate that targeted interventions on causally critical components can significantly alter safety behaviors, with findings indicating that safety-critical mechanisms are localized mainly in early-to-middle layers of the transformer architecture.
  • Comprehensive detection methods based on causal signals achieve over 95% accuracy in detecting various threats such as jailbreaks and hallucinations, showcasing the framework's practical efficacy for real-world applications.
  • The research includes a systematic survey of existing causality-driven jailbreak methods, therefore contextualizing the presented framework within the larger scope of LLM security research.
  • Recommendation for future works on identifying hidden vulnerabilities within LLMs, emphasizing ongoing research directions that leverage causal analysis for improved safety measures.

💡 Why This Paper Matters

This paper is highly relevant as it provides a structured and reproducible framework for analyzing and improving the security of large language models against adversarial manipulations. By focusing on causal mechanisms, it not only enhances our understanding of model vulnerabilities but also lays down a foundation for developing robust defenses, making it a significant contribution to the field of AI safety.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper particularly intriguing due to its innovative approach to understanding the security vulnerabilities of large language models through causality analysis. The framework developed can guide further research on effective defenses against various adversarial attacks, allowing for the design of safer AI systems.

📚 Read the Full Paper