← Back to Library

ShadowLogic: Backdoors in Any Whitebox LLM

Authors: Kasimir Schulz, Amelia Kawasaki, Leo Ring

Published: 2025-11-01

arXiv ID: 2511.00664v1

Added to Library: 2025-11-14 23:05 UTC

Red Teaming

📄 Abstract

Large language models (LLMs) are widely deployed across various applications, often with safeguards to prevent the generation of harmful or restricted content. However, these safeguards can be covertly bypassed through adversarial modifications to the computational graph of a model. This work highlights a critical security vulnerability in computational graph-based LLM formats, demonstrating that widely used deployment pipelines may be susceptible to obscured backdoors. We introduce ShadowLogic, a method for creating a backdoor in a white-box LLM by injecting an uncensoring vector into its computational graph representation. We set a trigger phrase that, when added to the beginning of a prompt into the LLM, applies the uncensoring vector and removes the content generation safeguards in the model. We embed trigger logic directly into the computational graph which detects the trigger phrase in a prompt. To evade detection of our backdoor, we obfuscate this logic within the graph structure, making it similar to standard model functions. Our method requires minimal alterations to model parameters, making backdoored models appear benign while retaining the ability to generate uncensored responses when activated. We successfully implement ShadowLogic in Phi-3 and Llama 3.2, using ONNX for manipulating computational graphs. Implanting the uncensoring vector achieved a >60% attack success rate for further malicious queries.

🔍 Key Points

  • Introduction of ShadowLogic, a method for embedding backdoors in white-box LLMs via computational graph manipulation.
  • Demonstration that trigger phrases can be used to covertly bypass safety mechanisms without altering model weights or parameters.
  • Successful implementation of ShadowLogic in state-of-the-art LLMs, achieving a >60% attack success rate on malicious queries.
  • Identification of major vulnerabilities in ONNX and other computational graph formats highlighting the ease of inserting malicious logic in deployed models.
  • Proposes measures for mitigation and integrity verification, emphasizing the need for enhanced security practices in AI model deployment.

💡 Why This Paper Matters

The paper reveals a critical vulnerability in large language models that allows adversaries to embed covert backdoors into their computational graphs. This demonstrates the inadequacy of current safety mechanisms and presents a pressing need for improved safeguards. The findings advocate for more rigorous security measures in the deployment of AI models, making it crucial for practitioners and researchers alike to understand and counteract these vulnerabilities.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly significant for AI security researchers as it unveils previously unexplored attack vectors in machine learning model deployment, specifically through the manipulation of computational graphs. The criticisms of current safety measures and the introduction of practical backdoor techniques highlight the urgency for new protective technologies, making it a vital read for those invested in safeguarding AI systems against malicious exploitation.

📚 Read the Full Paper