TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

📄 Abstract

Existing jailbreak defense paradigms primarily rely on static detection of prompts, outputs, or internal states, often neglecting the dynamic evolution of risk during decoding. This oversight leaves risk signals embedded in decoding trajectories underutilized, constituting a critical blind spot in current defense systems. In this work, we empirically demonstrate that hidden states in critical layers during the decoding phase carry stronger and more stable risk signals than input jailbreak prompts. Specifically, the hidden representations of tokens generated during jailbreak attempts progressively approach high-risk regions in the latent space. Based on this observation, we propose TrajGuard, a training-free, decoding-time defense framework. TrajGuard aggregates hidden-state trajectories via a sliding window to quantify risk in real time, triggering a lightweight semantic adjudication only when risk within a local window persistently exceeds a threshold. This mechanism enables the immediate interruption or constraint of subsequent decoding. Extensive experiments across 12 jailbreak attacks and various open-source LLMs show that TrajGuard achieves an average defense rate of 95%. Furthermore, it reduces detection latency to 5.2 ms/token while maintaining a false positive rate below 1.5%. These results confirm that hidden-state trajectories during decoding can effectively support real-time jailbreak detection, highlighting a promising direction for defenses without model modification.

🔍 Key Points

Introduction of TrajGuard, a framework that utilizes hidden-state trajectories for real-time jailbreak detection without the need for training or model modification.
Demonstration that hidden states in critical layers during decoding contain stronger signals for risk assessment than static input prompts, which contribute to the success of jailbreak attempts.
Implementation of a two-module system: Streaming Geometric Surveillance (SGS) for risk signal monitoring, and Prompt–Answer Inference Referee (PAIR-Judge) for semantic analysis during high-risk scenarios.
Achieved an average defense rate of 95% with reduced detection latency of 5.2 ms/token, highlighting the framework's efficiency in real-time settings.
Performed extensive evaluations across various open-source language models and reported robust defense against 12 distinct jailbreak attacks.

💡 Why This Paper Matters

This paper presents a significant advancement in the field of AI security by proposing TrajGuard, a dynamic framework that effectively intercepts jailbreak attempts during decoding by leveraging internal model representations. The innovative approach of analyzing hidden-state trajectories offers a promising alternative to traditional static defenses, marking a vital step toward enhancing the safety of large language models in interaction scenarios.

🎯 Why It's Interesting for AI Security Researchers

This work is particularly relevant for AI security researchers as it addresses the persistent threat posed by jailbreak attacks, which exploit vulnerabilities in large language models. The introduction of a framework that operates on the internal dynamics of model behavior rather than just external input signifies a shift in defense strategies, opening up new avenues for research and development in real-time AI safety mechanisms.

TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper