← Back to Library

DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern

Authors: Xiaoyi Pang, Xuanyi Hao, Pengyu Liu, Qi Luo, Song Guo, Zhibo Wang

Published: 2026-03-02

arXiv ID: 2603.01574v1

Added to Library: 2026-03-03 04:00 UTC

Red Teaming

📄 Abstract

Recent intelligent systems integrate powerful Large Language Models (LLMs) through APIs, but their trustworthiness may be critically undermined by targeted attacks like backdoor and prompt injection attacks, which secretly force LLMs to generate specific malicious sequences. Existing defensive approaches for such threats typically rely on high access rights, impose prohibitive costs, and hinder normal inference, rendering them impractical for real-world scenarios. To solve these limitations, we introduce DualSentinel, a lightweight and unified defense framework that can accurately and promptly detect the activation of targeted attacks alongside the LLM generation process. We first identify a characteristic of compromised LLMs, termed Entropy Lull: when a targeted attack successfully hijacks the generation process, the LLM exhibits a distinct period of abnormally low and stable token probability entropy, indicating it is following a fixed path rather than making creative choices. DualSentinel leverages this pattern by developing an innovative dual-check approach. It first employs a magnitude and trend-aware monitoring method to proactively and sensitively flag an entropy lull pattern at runtime. Upon such flagging, it triggers a lightweight yet powerful secondary verification based on task-flipping. An attack is confirmed only if the entropy lull pattern persists across both the original and the flipped task, proving that the LLM's output is coercively controlled. Extensive evaluations show that DualSentinel is both highly effective (superior detection accuracy with near-zero false positives) and remarkably efficient (negligible additional cost), offering a truly practical path toward securing deployed LLMs. The source code can be accessed at https://doi.org/10.5281/zenodo.18479273.

🔍 Key Points

  • Introduction of DualSentinel, a lightweight framework that detects targeted attacks in Large Language Models (LLMs) through the observation of an entropy lull pattern.
  • Identification of the 'Entropy Lull' phenomenon, which indicates a stable period with low token probability entropy during LLM output under attack, signaling compromised behavior.
  • Development of a dual-check mechanism utilizing magnitude and trend-aware monitoring alongside task-flipping verification to confirm targeted attacks while minimizing false positives.
  • Comprehensive evaluation demonstrating that DualSentinel achieves near 100% detection accuracy with negligible false positive rates and minimal performance overhead, making it practical for real-world deployment.
  • Comparison with existing defense mechanisms shows DualSentinel significantly outperforms established methods in robustness and efficiency.

💡 Why This Paper Matters

The DualSentinel framework represents a significant advancement in the security of large language models, addressing a critical need to detect targeted attacks effectively. By leveraging a novel method to discern the abnormal behavior of compromised models, this framework provides a practical, efficient solution for real-world applications where high-stakes decision-making relies on LLM outputs. Its capabilities in maintaining high detection accuracy without incurring substantial computational costs underscore its viability as a security tool in AI systems.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant for AI security researchers as it offers innovative solutions to emerging security vulnerabilities in generative models. As LLMs gain wider adoption across various applications, ensuring their integrity against targeted attacks becomes paramount. The insights provided in the paper, such as the entropy lull phenomenon and the proposed detection framework, equip researchers with essential methodologies to enhance the robustness of AI systems against adversarial exploitation.

📚 Read the Full Paper