Jailbreaking Embodied LLMs via Action-level Manipulation

📄 Abstract

Embodied Large Language Models (LLMs) enable AI agents to interact with the physical world through natural language instructions and actions. However, beyond the language-level risks inherent to LLMs themselves, embodied LLMs with real-world actuation introduce a new vulnerability: instructions that appear semantically benign may still lead to dangerous real-world consequences, revealing a fundamental misalignment between linguistic security and physical outcomes. In this paper, we introduce Blindfold, an automated attack framework that leverages the limited causal reasoning capabilities of embodied LLMs in real-world action contexts. Rather than iterative trial-and-error jailbreaking of black-box embodied LLMs, Blindfold adopts an Adversarial Proxy Planning strategy: it compromises a local surrogate LLM to perform action-level manipulations that appear semantically safe but could result in harmful physical effects when executed. Blindfold further conceals key malicious actions by injecting carefully crafted noise to evade detection by defense mechanisms, and it incorporates a rule-based verifier to improve the attack executability. Evaluations on both embodied AI simulators and a real-world 6DoF robotic arm show that Blindfold achieves up to 53% higher attack success rates than SOTA baselines, highlighting the urgent need to move beyond surface-level language censorship and toward consequence-aware defense mechanisms to secure embodied LLMs.

🔍 Key Points

Introduction of Blindfold: An automated attack framework that leverages the vulnerabilities of embodied LLMs at the action-level, focusing on real-world consequences of seemingly benign language instructions.
Adversarial Proxy Planning Strategy: Blindfold utilizes a compromised local surrogate LLM to generate action-level commands that may appear safe but lead to harmful physical effects, circumventing existing semantic-based defenses.
Intent Obfuscation Technique: The framework includes an intent obfuscation module that injects noise into action commands to conceal malicious intents from detecting mechanisms while maintaining the required physical outcomes.
Empirical Validations: Blindfold significantly outperforms state-of-the-art (SOTA) baselines, achieving a 53% higher attack success rate, emphasizing critical weaknesses in existing embodied LLM security measures.
Call for New Defense Mechanisms: The findings highlight the urgent need for researchers to explore consequence-aware defense mechanisms that address vulnerabilities unique to embodied LLMs.

💡 Why This Paper Matters

This paper presents significant advances in the understanding and addressing of security vulnerabilities related to embodied LLMs, demonstrating both the framework's effectiveness and the potential for real-world implications if these vulnerabilities are left unmitigated. The novel methods introduced offer a fresh perspective on the issues of language understanding vs. action consequences, outlining a clear path for future research into more robust defenses.

🎯 Why It's Interesting for AI Security Researchers

The research is particularly relevant to AI security researchers as it exposes a critical and under-explored area of vulnerability in embodied AI systems. By highlighting how linguistic manipulation can lead to dangerous actions in the real world, this work encourages a re-evaluation of existing security measures and the exploration of innovative solutions tailored to address these unique risks.

Jailbreaking Embodied LLMs via Action-level Manipulation

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper