← Back to Library

GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models

Authors: Xiangdong Hu, Yangyang Jiang, Qin Hu, Xiaojun Jia

Published: 2026-01-06

arXiv ID: 2601.03416v1

Added to Library: 2026-01-08 03:04 UTC

Red Teaming

📄 Abstract

Multimodal Large Language Models (MLLMs) have become widely deployed, yet their safety alignment remains fragile under adversarial inputs. Previous work has shown that increasing inference steps can disrupt safety mechanisms and lead MLLMs to generate attacker-desired harmful content. However, most existing attacks focus on increasing the complexity of the modified visual task itself and do not explicitly leverage the model's own reasoning incentives. This leads to them underperforming on reasoning models (Models with Chain-of-Thoughts) compared to non-reasoning ones (Models without Chain-of-Thoughts). If a model can think like a human, can we influence its cognitive-stage decisions so that it proactively completes a jailbreak? To validate this idea, we propose GAMBI} (Gamified Adversarial Multimodal Breakout via Instructional Traps), a novel multimodal jailbreak framework that decomposes and reassembles harmful visual semantics, then constructs a gamified scene that drives the model to explore, reconstruct intent, and answer as part of winning the game. The resulting structured reasoning chain increases task complexity in both vision and text, positioning the model as a participant whose goal pursuit reduces safety attention and induces it to answer the reconstructed malicious query. Extensive experiments on popular reasoning and non-reasoning MLLMs demonstrate that GAMBIT achieves high Attack Success Rates (ASR), reaching 92.13% on Gemini 2.5 Flash, 91.20% on QvQ-MAX, and 85.87% on GPT-4o, significantly outperforming baselines.

🔍 Key Points

  • Introduction of GAMBIT as a novel jailbreak framework for Multimodal Large Language Models (MLLMs), combining reasoning and gamification strategies to bypass safety filters.
  • Implementation of a three-module approach consisting of Puzzle-based Multimodal Encoding, Gamified Scene Construction, and Adaptive Search over Prompt Components to manipulate model responses systematically.
  • Extensive experimentation demonstrating high Attack Success Rates (ASR) across reasoning and non-reasoning models, significantly outperforming previous jailbreak methods like VisCRA and SI-Attack.
  • Identification of cognitive vulnerabilities within MLLMs that can be exploited by increasing task complexity and engagement through gamified elements, leading to reduced model adherence to safety mechanisms.
  • Proposed defense strategies to improve safety mechanisms in AI systems by recognizing their fragility under cognitive load.

💡 Why This Paper Matters

The GAMBIT framework effectively highlights the vulnerabilities inherent in MLLMs, particularly under adversarial conditions involving gamification and complex task structures. Its contributions are significant as they not only advance our understanding of how models prioritize tasks under cognitive strain but also inform future defenses against such adversarial attacks, making it a critical study for ensuring safe and reliable AI.

🎯 Why It's Interesting for AI Security Researchers

This paper is of great interest to AI security researchers as it exposes critical weaknesses in safety alignments of MLLMs, illustrating how adversarial tactics can manipulate model reasoning to generate harmful outputs. The findings emphasize the need for more robust defense mechanisms, providing a practical basis for improving AI safety protocols in the face of evolving adversarial techniques.

📚 Read the Full Paper