← Back to Library

RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation

Authors: Benyamin Tafreshian

Published: 2025-11-24

arXiv ID: 2511.18790v1

Added to Library: 2025-11-25 04:01 UTC

Red Teaming

πŸ“„ Abstract

Content moderation pipelines for modern large language models combine static filters, dedicated moderation services, and alignment tuned base models, yet real world deployments still exhibit dangerous failure modes. This paper presents RoguePrompt, an automated jailbreak attack that converts a disallowed user query into a self reconstructing prompt which passes provider moderation while preserving the original harmful intent. RoguePrompt partitions the instruction across two lexical streams, applies nested classical ciphers, and wraps the result in natural language directives that cause the target model to decode and execute the hidden payload. Our attack assumes only black box access to the model and to the associated moderation endpoint. We instantiate RoguePrompt against GPT 4o and evaluate it on 2 448 prompts that a production moderation system previously marked as strongly rejected. Under an evaluation protocol that separates three security relevant outcomes bypass, reconstruction, and execution the attack attains 84.7 percent bypass, 80.2 percent reconstruction, and 71.5 percent full execution, substantially outperforming five automated jailbreak baselines. We further analyze the behavior of several automated and human aligned evaluators and show that dual layer lexical transformations remain effective even when detectors rely on semantic similarity or learned safety rubrics. Our results highlight systematic blind spots in current moderation practice and suggest that robust deployment will require joint reasoning about user intent, decoding workflows, and model side computation rather than surface level toxicity alone.

πŸ” Key Points

  • Introduction of RoguePrompt as an automated jailbreak pipeline for large language models (LLMs) that preserves harmful intent while bypassing existing moderation systems.
  • Utilization of dual-layer encryption (VigenΓ¨re and ROT-13 ciphers) to create self-reconstructing jailbreak prompts, allowing the original malicious requests to be executed without triggering safety mechanisms.
  • Robust performance in breaking through moderation filters, achieving 84.7% bypass, 80.2% reconstruction, and 71.5% execution success rates against GPT-4o using a set of real-world prompts previously marked as forbidden.
  • Comparison against five baseline jailbreak methods, demonstrating superior effectiveness and revealing critical blind spots in current moderation practices.
  • Proposed evaluation methodology focused on bypass, reconstruction, and execution metrics that highlight how jailbreak risks persist even in sophisticated AI systems.

πŸ’‘ Why This Paper Matters

This paper is significant as it showcases the potential vulnerabilities within LLM moderation systems, exposing how advanced techniques like RoguePrompt can effectively subvert these safeguards. The substantial success rates of the proposed attacks indicate a pressing need for developing more resilient moderation frameworks that can contend with multi-stage decoding and intent reconstruction operations. By highlighting these vulnerabilities, the research calls for a reevaluation of existing defenses in AI systems and proposes pathways to enhance their reliability against sophisticated threats.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly compelling as it delves into the intersection of adversarial attacks on language models and content moderation. The novel post-processing techniques introduced, alongside concrete quantitative results, provide insights into the current limitations of LLM defenses. Furthermore, the evaluation framework proposed here serves as a benchmark for assessing the robustness of future moderation strategies. Understanding how these attack vectors operate will be crucial for improving AI safety and developing models that better recognize and mitigate potential misuse.

πŸ“š Read the Full Paper