← Back to Library

JailWAM: Jailbreaking World Action Models in Robot Control

Authors: Hanqing Liu, Songping Wang, Jiahuan Long, Jiacheng Hou, Jialiang Sun, Chao Li, Yang Yang, Wei Peng, Xu Liu, Tingsong Jiang, Wen Yao, Yao Mu

Published: 2026-04-07

arXiv ID: 2604.05498v1

Added to Library: 2026-04-08 02:02 UTC

📄 Abstract

The World Action Model (WAM) can jointly predict future world states and actions, exhibiting stronger physical manipulation capabilities compared with traditional models. Such powerful physical interaction ability is a double-edged sword: if safety is ignored, it will directly threaten personal safety, property security and environmental safety. However, existing research pays extremely limited attention to the critical security gap: the vulnerability of WAM to jailbreak attacks. To fill this gap, we define the Three-Level Safety Classification Framework to systematically quantify the safety of robotic arm motions. Furthermore, we propose JailWAM, the first dedicated jailbreak attack and evaluation framework for WAM, which consists of three core components: (1) Visual-Trajectory Mapping, which unifies heterogeneous action spaces into visual trajectory representations and enables cross-architectural unified evaluation; (2) Risk Discriminator, which serves as a high-recall screening tool that optimizes the efficiency-accuracy trade-off when identifying destructive behaviors in visual trajectories; (3) Dual-Path Verification Strategy, which first conducts rapid coarse screening via a single-image-based video-action generation module, and then performs efficient and comprehensive verification through full closed-loop physical simulation. In addition, we construct JailWAM-Bench, a benchmark for comprehensively evaluating the safety alignment performance of WAM under jailbreak attacks. Experiments in RoboTwin simulation environment demonstrate that the proposed framework efficiently exposes physical vulnerabilities, achieving an 84.2% attack success rate on the state-of-the-art LingBot-VA. Meanwhile, robust defense mechanisms can be constructed based on JailWAM, providing an effective technical solution for designing safe and reliable robot control systems.

🔍 Key Points

  • Introduction of the concept of "inscriptive jailbreaks" that utilize text-to-image models to generate harmful text embedded within benign imagery, representing a new attack vector that evades traditional visual filters.
  • Development of the Etch framework, which decomposes adversarial prompts into three layers—semantic camouflage, visual-spatial anchoring, and typographic encoding—to effectively navigate dual-layer safety filters.
  • Empirical evaluation across seven state-of-the-art text-to-image models, demonstrating an average attack success rate (ASR) of 65.57%, with peaks at 91.00%, significantly outperforming existing baselines.
  • Identification of critical vulnerabilities in current text-to-image safety mechanisms, highlighting the need for typography-aware defense strategies against these novel attack types.
  • Introduction of a new evaluation protocol using a vision-language model (VLM) as a diagnostic oracle for iterative prompt refinement, showcasing the potential of closed-loop feedback in prompt optimization.

💡 Why This Paper Matters

This paper is relevant as it sheds light on a critical and previously overlooked vulnerability in text-to-image models, emphasizing the significant risks posed by inscriptive jailbreaks. The novel methods introduced, particularly the Etch framework, not only advance the understanding of adversarial attacks on multimodal AI systems but also underscore the urgent need for enhanced defense mechanisms that consider the intricacies of typographic content. The findings serve as a wake-up call for researchers and practitioners in the field of AI security, urging the development of robust safety alignments that can effectively mitigate these emerging threats.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly interesting due to its exploration of a novel attack methodology that exploits existing text-to-image model capabilities, thereby revealing gaps in current safety protocols. The insights gained from the Etch framework and its empirical evaluations provide a critical foundation for further research into adversarial resilience, as well as the development of corresponding defense strategies against inscriptive threats. Additionally, the integration of VLMs for diagnostic purposes presents an innovative approach to enhancing the robustness of AI systems, which is a key focus area in the field of AI safety research.

📚 Read the Full Paper