← Back to Library

MirrorGuard: Toward Secure Computer-Use Agents via Simulation-to-Real Reasoning Correction

Authors: Wenqi Zhang, Yulin Shen, Changyue Jiang, Jiarun Dai, Geng Hong, Xudong Pan

Published: 2026-01-19

arXiv ID: 2601.12822v1

Added to Library: 2026-01-21 04:00 UTC

📄 Abstract

Large foundation models are integrated into Computer Use Agents (CUAs), enabling autonomous interaction with operating systems through graphical user interfaces (GUIs) to perform complex tasks. This autonomy introduces serious security risks: malicious instructions or visual prompt injections can trigger unsafe reasoning and cause harmful system-level actions. Existing defenses, such as detection-based blocking, prevent damage but often abort tasks prematurely, reducing agent utility. In this paper, we present MirrorGuard, a plug-and-play defense framework that uses simulation-based training to improve CUA security in the real world. To reduce the cost of large-scale training in operating systems, we propose a novel neural-symbolic simulation pipeline, which generates realistic, high-risk GUI interaction trajectories entirely in a text-based simulated environment, which captures unsafe reasoning patterns and potential system hazards without executing real operations. In the simulation environment, MirrorGuard learns to intercept and rectify insecure reasoning chains of CUAs before they produce and execute unsafe actions. In real-world testing, extensive evaluations across diverse benchmarks and CUA architectures show that MirrorGuard significantly mitigates security risks. For instance, on the ByteDance UI-TARS system, it reduces the unsafe rate from 66.5% to 13.0% while maintaining a marginal false refusal rate (FRR). In contrast, the state-of-the-art GuardAgent only achieves a reduction to 53.9% and suffers from a 15.4% higher FRR. Our work proves that simulation-derived defenses can provide robust, real-world protection while maintaining the fundamental utility of the agent. Our code and model are publicly available at https://bmz-q-q.github.io/MirrorGuard/.

🔍 Key Points

  • Introduction of 'sockpuppeting' as a simple and effective jailbreaking method for open-weight LLMs, requiring only a single line of code and no optimization.
  • Sockpuppeting achieves up to 80% higher attack success rate (ASR) than existing methods like GCG when targeted at specific models, making it accessible to less sophisticated attackers.
  • Exploration of a hybrid approach combining sockpuppetting with gradient optimization, resulting in a 64% increase in ASR in prompt-agnostic settings on certain models.
  • The research highlights the inadequacies of current defenses against output-prefix injections in LLMs, particularly for open-weight models, emphasizing a need for stronger mitigation strategies.
  • Findings suggest that model responses can be manipulated significantly through carefully designed acceptance sequences, revealing vulnerabilities in LLMs' autoregressive behavior.

💡 Why This Paper Matters

The paper presents significant advancements in the understanding of vulnerabilities within open-weight large language models through the introduction of sockpuppeting. Its simple implementation and substantial effectiveness pose serious implications for LLM safety, particularly in light of the increasing capability of these models. The novel approach and findings underscore the urgent need for enhanced defensive mechanisms against such easy-to-execute attacks, making it crucial for both developers and researchers in the field.

🎯 Why It's Interesting for AI Security Researchers

This paper is vital for AI security researchers as it uncovers new attack vectors against large language models, particularly focusing on the threats posed by open-weight models. The revealed effectiveness of sockpuppeting, in addition to its low barrier to entry for potential attackers, raises critical concerns about the robustness of LLMs. Understanding these vulnerabilities is essential for developing secure AI models and creating frameworks that can better defend against unauthorized manipulations.

📚 Read the Full Paper