DecipherGuard: Understanding and Deciphering Jailbreak Prompts for a Safer Deployment of Intelligent Software Systems

Authors: Rui Yang, Michael Fu, Chakkrit Tantithamthavorn, Chetan Arora, Gunel Gulmammadova, Joey Chua

Published: 2025-09-21

arXiv ID: 2509.16870v1

Added to Library: 2025-09-23 04:01 UTC

Red Teaming

📄 Abstract

Intelligent software systems powered by Large Language Models (LLMs) are increasingly deployed in critical sectors, raising concerns about their safety during runtime. Through an industry-academic collaboration when deploying an LLM-powered virtual customer assistant, a critical software engineering challenge emerged: how to enhance a safer deployment of LLM-powered software systems at runtime? While LlamaGuard, the current state-of-the-art runtime guardrail, offers protection against unsafe inputs, our study reveals a Defense Success Rate (DSR) drop of 24% under obfuscation- and template-based jailbreak attacks. In this paper, we propose DecipherGuard, a novel framework that integrates a deciphering layer to counter obfuscation-based prompts and a low-rank adaptation mechanism to enhance guardrail effectiveness against template-based attacks. Empirical evaluation on over 22,000 prompts demonstrates that DecipherGuard improves DSR by 36% to 65% and Overall Guardrail Performance (OGP) by 20% to 50% compared to LlamaGuard and two other runtime guardrails. These results highlight the effectiveness of DecipherGuard in defending LLM-powered software systems against jailbreak attacks during runtime.

🔍 Key Points

Introduction of DecipherGuard, a framework combining deciphering and low-rank adaptation to enhance LLM safety against jailbreak prompts.
Empirical validation showing DecipherGuard improves Defense Success Rate (DSR) by 36% to 65% compared to existing guardrails like LlamaGuard.
Introduction of Overall Guardrail Performance (OGP) metric, addressing both defense success and false alarms, offering a more comprehensive evaluation approach.
Identification of significant vulnerabilities in state-of-the-art guardrails when faced with obfuscation and template-based jailbreak attacks.
Detailed ablation studies highlighting the individual contributions of the deciphering layer and low-rank adaptation to the overall effectiveness of DecipherGuard.

💡 Why This Paper Matters

This paper presents an innovative approach to enhancing the safety of intelligent software systems powered by LLMs, particularly in safeguarding against sophisticated jailbreak attacks. By proposing DecipherGuard and a new metric for evaluating guardrail performance, the researchers provide critical insights and practical tools that can help mitigate security risks in deploying AI systems across various sensitive applications. These findings carry implications for the future of safe AI operations, making this research highly relevant and valuable.

🎯 Why It's Interesting for AI Security Researchers

The relevance of this paper to AI security researchers lies in its focus on enhancing the robustness of LLM-based systems against emerging threats. The novel methods introduced for detecting and mitigating jailbreak attacks represent a significant advancement in security measures. Furthermore, the newly established metric (OGP) offers researchers a more effective means to evaluate and compare security systems, fostering enhanced collaboration and improvement in AI security methodologies.

DecipherGuard: Understanding and Deciphering Jailbreak Prompts for a Safer Deployment of Intelligent Software Systems

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper