← Back to Library

Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs

Authors: Masahiro Kaneko, Timothy Baldwin

Published: 2025-10-19

arXiv ID: 2510.17000v1

Added to Library: 2025-10-21 04:03 UTC

Red Teaming

📄 Abstract

Adversarial attacks by malicious users that threaten the safety of large language models (LLMs) can be viewed as attempts to infer a target property $T$ that is unknown when an instruction is issued, and becomes knowable only after the model's reply is observed. Examples of target properties $T$ include the binary flag that triggers an LLM's harmful response or rejection, and the degree to which information deleted by unlearning can be restored, both elicited via adversarial instructions. The LLM reveals an \emph{observable signal} $Z$ that potentially leaks hints for attacking through a response containing answer tokens, thinking process tokens, or logits. Yet the scale of information leaked remains anecdotal, leaving auditors without principled guidance and defenders blind to the transparency--risk trade-off. We fill this gap with an information-theoretic framework that computes how much information can be safely disclosed, and enables auditors to gauge how close their methods come to the fundamental limit. Treating the mutual information $I(Z;T)$ between the observation $Z$ and the target property $T$ as the leaked bits per query, we show that achieving error $\varepsilon$ requires at least $\log(1/\varepsilon)/I(Z;T)$ queries, scaling linearly with the inverse leak rate and only logarithmically with the desired accuracy. Thus, even a modest increase in disclosure collapses the attack cost from quadratic to logarithmic in terms of the desired accuracy. Experiments on seven LLMs across system-prompt leakage, jailbreak, and relearning attacks corroborate the theory: exposing answer tokens alone requires about a thousand queries; adding logits cuts this to about a hundred; and revealing the full thinking process trims it to a few dozen. Our results provide the first principled yardstick for balancing transparency and security when deploying LLMs.

🔍 Key Points

  • Introduction of an information-theoretic framework to quantify adversarial attack success rates against LLMs based on mutual information between observable signals and target properties.
  • The research presents a concrete relationship: the number of queries required to achieve a desired error rate increases logarithmically with the amount of information leaked per query, calculated as N_min = log(1/ε)/I(Z;T).
  • Empirical validation of the theoretical framework across various attack scenarios, demonstrating that revealing even slight additional information (such as logits or thinking processes) significantly decreases the number of queries needed for successful attacks.
  • Conducted experiments across multiple LLMs (GPT-4, OLMo, Llama) under different attack regimes, providing a comprehensive view of the security landscape surrounding LLM deployments.
  • The analysis offers practitioners a principled approach to balancing transparency in LLM outputs against potential security vulnerabilities.

💡 Why This Paper Matters

This paper is significant as it provides a structured, information-theoretic approach to evaluating the security risks associated with transparency in large language models. By quantifying the amount of information that can be safely revealed in model outputs without significantly increasing the risk of adversarial attacks, it serves as a critical resource for ensuring the safe deployment of LLMs in sensitive environments.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant to AI security researchers as it enhances the understanding of how transparency in AI systems can be exploited by malicious actors. The findings offer valuable insights into developing robust defenses against adversarial attacks, making it crucial for researchers focused on aligning AI capabilities with security measures.

📚 Read the Full Paper