ShallowJail: Steering Jailbreaks against Large Language Models

Authors: Shang Liu, Hanyu Pei, Zeyan Liu

Published: 2026-02-06

arXiv ID: 2602.07107v1

Added to Library: 2026-02-10 03:01 UTC

Red Teaming

📄 Abstract

Large Language Models(LLMs) have been successful in numerous fields. Alignment has usually been applied to prevent them from harmful purposes. However, aligned LLMs remain vulnerable to jailbreak attacks that deliberately mislead them into producing harmful outputs. Existing jailbreaks are either black-box, using carefully crafted, unstealthy prompts, or white-box, requiring resource-intensive computation. In light of these challenges, we introduce ShallowJail, a novel attack that exploits shallow alignment in LLMs. ShallowJail can misguide LLMs' responses by manipulating the initial tokens during inference. Through extensive experiments, we demonstrate the effectiveness of~\shallow, which substantially degrades the safety of state-of-the-art LLM responses.

🔍 Key Points

Introduction of ShallowJail: a novel jailbreak attack method exploiting shallow alignment in LLMs, targeting the manipulation of initial tokens during inference.
Demonstrated a significantly high attack success rate (ASR), exceeding 90% on several state-of-the-art LLMs, confirming the vulnerability of LLMs during the initial output stages.
Presentation of a two-stage framework involving Steering Vectors Construction and Jailbreak Prompting, effectively guiding model outputs towards undesirable responses.
Extensive evaluation across multiple victim models, showcasing the effectiveness of ShallowJail across diverse datasets and conditions, highlighting its superiority to existing black-box and white-box methods.
Analysis of hyperparameter sensitivity, showing how slight adjustments in steering parameters can drastically influence the attack's success, offering a roadmap for potential optimizations.

💡 Why This Paper Matters

The paper 'ShallowJail: Steering Jailbreaks against Large Language Models' is highly relevant as it uncovers critical vulnerabilities in LLM safety mechanisms, demonstrating how commonly employed alignment strategies can be circumvented. The findings stress the need for improved safeguard frameworks in the deployment of LLMs, especially as these models become increasingly integrated into sensitive applications.

🎯 Why It's Interesting for AI Security Researchers

This paper is of significant interest to AI security researchers as it highlights a pressing vulnerability in widely used LLMs, showcasing a real-world breach of safety protocols. The development and understanding of such complex attack vectors are crucial for creating more resilient models, informing future research on AI safety, and guiding the implementation of robust defensive measures.

ShallowJail: Steering Jailbreaks against Large Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper