← Back to Library

SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

Authors: Zikai Zhang, Rui Hu, Olivera Kotevska, Jiahao Xu

Published: 2026-04-01

arXiv ID: 2604.01473v1

Added to Library: 2026-04-03 02:01 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to detect malicious queries, which either introduce substantial latency or suffer from the randomness in text generation. To overcome these limitations, we propose SelfGrader, a lightweight guardrail method that formulates jailbreak detection as a numerical grading problem using token-level logits. Specifically, SelfGrader evaluates the safety of a user query within a compact set of numerical tokens (NTs) (e.g., 0-9) and interprets their logit distribution as an internal safety signal. To align these signals with human intuition of maliciousness, SelfGrader introduces a dual-perspective scoring rule that considers both the maliciousness and benignness of the query, yielding a stable and interpretable score that reflects harmfulness and reduces the false positive rate simultaneously. Extensive experiments across diverse jailbreak benchmarks, multiple LLMs, and state-of-the-art guardrail baselines demonstrate that SelfGrader achieves up to a 22.66% reduction in ASR on LLaMA-3-8B, while maintaining significantly lower memory overhead (up to 173x) and latency (up to 26x).

🔍 Key Points

  • Introduction of SelfGrader, a lightweight guardrail method for detecting jailbreak attacks on large language models (LLMs) using numerical grading of token-level logits.
  • Utilization of a dual-perspective scoring rule that evaluates both maliciousness and benignness of queries to achieve a stable and interpretable safety measure.
  • SelfGrader's empirical evaluations demonstrate a significant reduction in Attack Success Rate (ASR) by up to 22.66% and lower memory overhead and latency compared to existing guardrail methods.
  • Extensive experiments showcase SelfGrader's performance across multiple jailbreak benchmarks and diverse LLMs, proving its robustness and efficiency against various attack methods.
  • Ablation studies emphasize the importance of in-context learning examples and dual perspective logit scoring in improving the robustness and accuracy of the detection system.

💡 Why This Paper Matters

This paper presents a significant advancement in AI security by proposing SelfGrader, a novel method for detecting jailbreak attacks on large language models. Its innovative use of token-level logits for safety assessment and dual scoring perspectives provides an effective solution to vulnerabilities in existing guardrail methods. The substantial improvements in attack detection rates, memory efficiency, and processing speed make SelfGrader an essential tool for safely deploying LLMs in real-world applications, ensuring compliance with safety protocols while maintaining high performance.

🎯 Why It's Interesting for AI Security Researchers

Researchers in AI security will find this paper particularly relevant as it addresses the growing concern of jailbreak attacks which can lead to unauthorized and harmful use of LLMs. The proposed method offers a practical solution to enhance the safety and resilience of deployed models, highlighting a crucial intersection of AI functionality and security. Moreover, the methodological advancements and empirical results serve as a foundation for further innovations in the protection of AI systems against adaptive adversarial techniques.

📚 Read the Full Paper