← Back to Library

Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

Authors: Purva Chiniya, Kevin Scaria, Sagar Chaturvedi

Published: 2026-04-06

arXiv ID: 2604.05179v1

Added to Library: 2026-04-08 02:00 UTC

Red Teaming

📄 Abstract

Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Previous work on jailbreak and prompt injection detection such as GradSafe, detects unsafe prompts with a single "accept all" anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines an acceptance anchor token ("Sure") and refusal anchor token ("Sorry") tightening the decision boundary and significantly lowering false positives. In the mitigation stage, if a prompt is flagged, GCD preset-injects one or two refusal tokens ("Sorry, I can't...") before autoregressive decoding resumes, guaranteeing first-token safety regardless of sampling strategy. On ToxicChat, XSTest-v2, and AdvBench, GCD reduces false positives by 52% vs. GradSafe at comparable recall, lowers attack success rate by up to 10% vs. the strongest decoding-only baseline, adds under 15-20 ms latency on an average on V100 instances, transfers to LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B, and requires only 20 demonstration templates.

🔍 Key Points

  • Introduction of Gradient-Controlled Decoding (GCD) as a dual-anchor safety mechanism for LLMs, addressing limitations of previous single-anchor systems.
  • GCD significantly reduces false positive rates by 52% compared to prior methods while maintaining similar recall rates, ensuring safer user interactions.
  • The system provides guaranteed first-token safety through preset refusal tokens irrespective of the sampling strategy, making it resilient to various prompt-injection attacks.
  • Performance evaluated across multiple benchmarks (ToxicChat, XSTest-v2, AdvBench) demonstrates robust generalizability, lower attack success rates, and manageable latency increases (15-20ms).
  • GCD integrates seamlessly across different models (LLaMA-2, Mixtral, Qwen), requiring only 20 demonstration templates for implementation.

💡 Why This Paper Matters

This paper is crucial as it presents a new, effective approach to enhancing the safety of large language models against adversarial attacks while minimizing negative impacts on user experience. By reducing false positives, it ensures a more reliable interaction framework, making LLMs more beneficial for sensitive applications.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this work invaluable as it addresses current vulnerabilities in LLMs, particularly in relation to jailbreak and prompt injection attacks. The proposed GCD mechanism represents a novel approach to safety that balances robustness and user experience, which is critical in the field of AI security, where effective defenses against adversarial manipulations are paramount.

📚 Read the Full Paper