← Back to Library

Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

Authors: Amirhossein Farzam, Majid Behabahani, Mani Malek, Yuriy Nevmyvaka, Guillermo Sapiro

Published: 2026-02-23

arXiv ID: 2602.19396v1

Added to Library: 2026-02-24 04:02 UTC

Red Teaming

📄 Abstract

Large language models (LLMs) remain vulnerable to jailbreak prompts that are fluent and semantically coherent, and therefore difficult to detect with standard heuristics. A particularly challenging failure mode occurs when an attacker tries to hide the malicious goal of their request by manipulating its framing to induce compliance. Because these attacks maintain malicious intent through a flexible presentation, defenses that rely on structural artifacts or goal-specific signatures can fail. Motivated by this, we introduce a self-supervised framework for disentangling semantic factor pairs in LLM activations at inference. We instantiate the framework for goal and framing and construct GoalFrameBench, a corpus of prompts with controlled goal and framing variations, which we use to train Representation Disentanglement on Activations (ReDAct) module to extract disentangled representations in a frozen LLM. We then propose FrameShield, an anomaly detector operating on the framing representations, which improves model-agnostic detection across multiple LLM families with minimal computational overhead. Theoretical guarantees for ReDAct and extensive empirical validations show that its disentanglement effectively powers FrameShield. Finally, we use disentanglement as an interpretability probe, revealing distinct profiles for goal and framing signals and positioning semantic disentanglement as a building block for both LLM safety and mechanistic interpretability.

🔍 Key Points

  • Introduction of a self-supervised framework for disentangling goal and framing signals in LLM activations, which enhances detection of jailbreak attempts.
  • Development of GoalFrameBench, a benchmark corpus for controlled evaluation of goal-framing variations in prompts, facilitating targeted defense research against jailbreaks.
  • Presentation of ReDAct, a module for Representation Disentanglement on Activations, which extracts distinct representations from LLM layers with minimal overhead.
  • Proposal of FrameShield, an anomaly detection framework that successfully identifies goal-preserving framing attacks with improved accuracy across several LLMs.
  • The study provides insights into the organization of semantic information in LLMs, enhancing mechanistic interpretability and LLM safety.

💡 Why This Paper Matters

This paper presents a significant advance in the understanding and detection of jailbreaks in large language models by systematically addressing the framing strategies that attackers employ. The introduction of novel methodologies, such as the GoalFrameBench and the ReDAct module, establishes new benchmarks in the field of LLM safety, making a substantial contribution to ensuring the integrity and security of these models against adversarial prompt manipulations. The results indicate a clear pathway toward more robust defenses against sophisticated attacks, thus holding vital implications for the responsible deployment of AI technologies.

🎯 Why It's Interesting for AI Security Researchers

The research is highly relevant to AI security researchers as it not only sheds light on the mechanisms behind jailbreak attacks but also offers effective countermeasures that can be generalized across multiple LLM architectures. Understanding how framing can manipulate model behavior provides crucial insights into potential vulnerabilities, while the developed tools and datasets can serve as benchmarks for future security testing and enhancement of AI models. The findings from this paper could guide researchers in creating more robust models, ultimately improving the resilience of AI systems to adversarial risks.

📚 Read the Full Paper