← Back to Library

Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

Authors: Xiaobing Sun, Perry Lam, Shaohua Li, Zizhou Wang, Rick Siow Mong Goh, Yong Liu, Liangli Zhen

Published: 2026-03-17

arXiv ID: 2603.16192v1

Added to Library: 2026-03-18 03:00 UTC

Red Teaming

πŸ“„ Abstract

Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic signature of the request across disjoint prompt segments; and (3) Clue-Guided Camouflage, which disguises residual semantic cues while embedding recoverable markers that guide output generation. By delaying and restructuring semantic consolidation, S2C degrades safety triggers that depend on coherent or explicitly reconstructed malicious intent at decoding time, while preserving sufficient instruction recoverability for functional output generation. We evaluate S2C across multiple open-source and proprietary LLMs using HarmBench and JBB-Behaviors, where it improves Attack Success Rate (ASR) by 12.4% and 9.7%, respectively, over the current SOTA. Notably, S2C achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors. We also analyse which combinations perform best against broad families of models, and characterise the trade-off between the extent of obfuscation versus input recoverability on jailbreak success.

πŸ” Key Points

  • Proposes Structured Semantic Cloaking (S2C), a novel multi-dimensional framework for jailbreak attacks on large language models (LLMs) that involves semantic distribution and restructuring during inference.
  • S2C includes three mechanisms: Contextual Reframing, Content Fragmentation, and Clue-Guided Camouflage, which work together to bypass conventional safety measures that rely on coherent semantic intent.
  • Empirical evaluations show that S2C outperforms previous state-of-the-art methods by significant margins, achieving up to 89.4% Attack Success Rate (ASR) across various models.
  • Highlights the systematic vulnerability of LLMs to attacks that exploit deeper latent representations rather than surface-level obfuscation, indicating a gap in current safety mechanisms.
  • Demonstrates the importance of understanding not just the malicious intent but also the timing and structure of semantic reconstruction during generative processes.

πŸ’‘ Why This Paper Matters

This paper introduces a significant advancement in the field of AI safety by illustrating a novel attack strategy against current safety mechanisms in LLMs. By manipulating the way malicious intent is formed and evaluated at deeper semantic levels, the authors expose relevant vulnerabilities within different model architectures. The findings and techniques proposed aim to contribute to a more robust discourse on AI safety mechanisms by highlighting what’s required to develop defenses that withstand more sophisticated attacks.

🎯 Why It's Interesting for AI Security Researchers

This research would be of considerable interest to AI security researchers for several reasons. Firstly, it challenges existing assumptions about the robustness of LLM safety measures, advocating for a reevaluation of their capabilities against increasingly complex adversarial strategies. Secondly, the techniques outlined provide actionable insights for understanding and potentially improving LLM defenses against manipulation, which is critical as such models are integrated into more safety-critical applications.

πŸ“š Read the Full Paper