Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models

Authors: Tung-Ling Li, Hongliang Liu

Published: 2025-06-30

arXiv ID: 2506.24056v1

Added to Library: 2025-07-01 04:00 UTC

Red Teaming

📄 Abstract

We introduce logit-gap steering, a fast jailbreak framework that casts the refusal-affirmation gap of RLHF-aligned language models as a single pass over the vocabulary. A forward-computable score blends gap reduction with lightweight proxies for KL penalty and reward shift, allowing a "sort-sum-stop" sweep to complete in under a second and return a short suffix--two orders of magnitude fewer model calls than beam or gradient attacks. The same suffix generalises to unseen prompts and scales from 0.5 B to 70 B checkpoints, lifting one-shot attack success from baseline levels to 80-100% while preserving topical coherence. Beyond efficiency, these suffixes expose sentence-boundary reward cliffs and other alignment artefacts, offering a lightweight probe into how safety tuning reshapes internal representations.

🔍 Key Points

Introduction of logit-gap steering as an efficient jailbreak strategy for RLHF-aligned large language models (LLMs), achieving two orders of magnitude reduction in model calls compared to traditional methods.
Development of a gap-closing score that integrates KL penalty and reward shift into a single calculable score, allowing rapid evaluation of suffix candidates.
Utilization of a greedy covering-based suffix selection algorithm to find short, impactful suffixes that effectively manipulate language model responses from refusal to affirmation.
Significant improvement in one-shot attack success rates across model scales from 0.5 billion to 70 billion parameters, achieving a success rate of 80-100% while maintaining topical coherence.
The discovered suffixes serve as lightweight probes for understanding alignment artifacts, revealing insights into how safety tuning modifies language models' internal mechanisms.

💡 Why This Paper Matters

The paper's introduction of logit-gap steering marks a significant advancement in the field of AI safety and security by providing a fast and efficient methodology for manipulating large language models' responses. The practical success in achieving high attack rates while preserving contextual relevance demonstrates the dual benefit of enhancing both model comprehension and security resilience.

🎯 Why It's Interesting for AI Security Researchers

This research is crucial for AI security researchers as it uncovers vulnerabilities within RLHF-aligned frameworks, providing a clear methodology for efficiently conducting jailbreaks. Understanding the mechanisms of such vulnerabilities allows for better design of safety mechanisms in AI models and contributes to the overall discourse on mitigating adversarial risks.

Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper