← Back to Library

GUARD-SLM: Token Activation-Based Defense Against Jailbreak Attacks for Small Language Models

Authors: Md Jueal Mia, Joaquin Molto, Yanzhao Wu, M. Hadi Amini

Published: 2026-03-28

arXiv ID: 2603.28817v1

Added to Library: 2026-04-01 02:03 UTC

Red Teaming

📄 Abstract

Small Language Models (SLMs) are emerging as efficient and economically viable alternatives to Large Language Models (LLMs), offering competitive performance with significantly lower computational costs and latency. These advantages make SLMs suitable for resource-constrained and efficient deployment on edge devices. However, existing jailbreak defenses show limited robustness against heterogeneous attacks, largely due to an incomplete understanding of the internal representations across different layers of language models that facilitate jailbreak behaviors. In this paper, we conduct a comprehensive empirical study on 9 jailbreak attacks across 7 SLMs and 3 LLMs. Our analysis shows that SLMs remain highly vulnerable to malicious prompts that bypass safety alignment. We analyze hidden-layer activations across different layers and model architectures, revealing that different input types form distinguishable patterns in the internal representation space. Based on this observation, we propose GUARD-SLM, a lightweight token activation-based method that operates in the representation space to filter malicious prompts during inference while preserving benign ones. Our findings highlight robustness limitations across layers of language models and provide a practical direction for secure small language model deployment.

🔍 Key Points

  • SLMs are found to be significantly more vulnerable to jailbreak attacks than LLMs, with the proposed empirical evaluation demonstrating this vulnerability using nine jailbreak attack strategies across various models.
  • The introduction of GUARD-SLM represents a novel approach to defense that relies on lightweight token activation analysis in the representation space of SLMs, detecting malicious prompts with high accuracy and low computational overhead.
  • The study showcases that hidden representations at multiple layers contain discernible patterns for different input types, providing a foundation for effective activation-space adversarial prompt filtering.
  • Extensive experimental validation on multiple SLMs and LLMs highlights the robustness of GUARD-SLM, achieving near-zero jailbreak success rates for various attack categories while ensuring real-time performance.
  • The findings reveal that layered analysis of internal representations can lead to better understanding and defenses against jailbreaking, emphasizing the need for layered-based intrusion detection in language models.

💡 Why This Paper Matters

This paper is crucial as it addresses the growing concern of jailbreak attacks on small language models, a domain often overshadowed by larger models. By presenting a novel defense mechanism in GUARD-SLM that leverages token activation analysis, the research provides a practical and scalable solution to enhance the security of language models, particularly in resource-constrained environments. The insights gained through layer-wise sensitivity analysis also contribute significantly to the overall understanding of model vulnerabilities, making this work relevant not only for immediate defensive strategies but also for future research in model safety.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly interesting as it sheds light on vulnerabilities specific to small language models, which are increasingly deployed in various applications. The innovative method of using internal layer activations for prompt filtering represents a significant shift towards proactive security measures in AI systems. Furthermore, as adversarial techniques like jailbreak attacks evolve, understanding and improving defenses becomes imperative, positioning this research as a valuable contribution to the field of AI safety and security.

📚 Read the Full Paper