← Back to Library

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

Authors: Zhen Sun, Zongmin Zhang, Deqi Liang, Han Sun, Yule Liu, Yun Shen, Xiangshan Gao, Yilong Yang, Shuai Liu, Yutao Yue, Xinlei He

Published: 2025-11-20

arXiv ID: 2511.16278v1

Added to Library: 2025-11-21 03:02 UTC

Red Teaming

📄 Abstract

As LLMs become more common, non-expert users can pose risks, prompting extensive research into jailbreak attacks. However, most existing black-box jailbreak attacks rely on hand-crafted heuristics or narrow search spaces, which limit scalability. Compared with prior attacks, we propose Game-Theory Attack (GTA), an scalable black-box jailbreak framework. Concretely, we formalize the attacker's interaction against safety-aligned LLMs as a finite-horizon, early-stoppable sequential stochastic game, and reparameterize the LLM's randomized outputs via quantal response. Building on this, we introduce a behavioral conjecture "template-over-safety flip": by reshaping the LLM's effective objective through game-theoretic scenarios, the originally safety preference may become maximizing scenario payoffs within the template, which weakens safety constraints in specific contexts. We validate this mechanism with classical game such as the disclosure variant of the Prisoner's Dilemma, and we further introduce an Attacker Agent that adaptively escalates pressure to increase the ASR. Experiments across multiple protocols and datasets show that GTA achieves over 95% ASR on LLMs such as Deepseek-R1, while maintaining efficiency. Ablations over components, decoding, multilingual settings, and the Agent's core model confirm effectiveness and generalization. Moreover, scenario scaling studies further establish scalability. GTA also attains high ASR on other game-theoretic scenarios, and one-shot LLM-generated variants that keep the model mechanism fixed while varying background achieve comparable ASR. Paired with a Harmful-Words Detection Agent that performs word-level insertions, GTA maintains high ASR while lowering detection under prompt-guard models. Beyond benchmarks, GTA jailbreaks real-world LLM applications and reports a longitudinal safety monitoring of popular HuggingFace LLMs.

🔍 Key Points

  • Introduction of the Game-Theory Attack (GTA) framework which formalizes black-box jailbreak interactions as a finite-horizon, early-stopping sequential stochastic game.
  • Proposition of the 'template-over-safety flip' conjecture that reshapes LLM's behavior by modifying the effective objective through game-theoretic scenario templates, potentially weakening safety constraints.
  • Experimental validation demonstrating GTA's high attack success rates (ASR), achieving over 95% in various models and settings while maintaining efficiency as compared to existing methods.
  • Inclusion of a Harmful-Words Detection Agent to evade prompt-guard models by inserting lexical perturbations, thereby enhancing GTA's effectiveness.
  • Scalability of GTA under different game-theoretic models and automatic generation of diverse templates, indicating robustness and adaptability in real-world applications.

💡 Why This Paper Matters

The paper presents a novel approach to jailbreak large language models using game theory, highlighting a crucial vulnerability in LLMs that impacts their safety and alignment capabilities. The high success rates and efficiency of the GTA framework signal a significant step towards understanding and addressing potential misuse of LLMs in real-world scenarios, emphasizing the importance of ongoing research in this area.

🎯 Why It's Interesting for AI Security Researchers

This paper presents critical insights into the security vulnerabilities of LLMs, making it essential for AI security researchers. The introduction of an automatable and scalable jailbreak methodology using game theory provides a framework for further investigation into LLM safety, potentially guiding the development of more robust defenses against adversarial attacks. Moreover, it raises awareness about the implications of LLM misuse in various applications, making it pivotal for future safety measures in AI deployment.

📚 Read the Full Paper