← Back to Library

Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment

Authors: Zehua Cheng, Jianwei Yang, Wei Dai, Jiahao Sun

Published: 2026-02-02

arXiv ID: 2602.01587v1

Added to Library: 2026-02-03 08:02 UTC

Red Teaming Safety

📄 Abstract

Large Language Models (LLMs) remain vulnerable to adaptive jailbreaks that easily bypass empirical defenses like GCG. We propose a framework for certifiable robustness that shifts safety guarantees from single-pass inference to the statistical stability of an ensemble. We introduce Certified Semantic Smoothing (CSS) via Stratified Randomized Ablation, a technique that partitions inputs into immutable structural prompts and mutable payloads to derive rigorous lo norm guarantees using the Hypergeometric distribution. To resolve performance degradation on sparse contexts, we employ Noise-Augmented Alignment Tuning (NAAT), which transforms the base model into a semantic denoiser. Extensive experiments on Llama-3 show that our method reduces the Attack Success Rate of gradient-based attacks from 84.2% to 1.2% while maintaining 94.1% benign utility, significantly outperforming character-level baselines which degrade utility to 74.3%. This framework provides a deterministic certificate of safety, ensuring that a model remains robust against all adversarial variants within a provable radius.

🔍 Key Points

  • Introduction of the Provable Defense Framework that leverages Certified Semantic Smoothing (CSS) to provide rigorous safety guarantees against jailbreaking attacks on LLMs.
  • Development of Noise-Augmented Alignment Tuning (NAAT) to address performance degradation while ensuring security, effectively transforming LLMs into semantic denoising models.
  • Empirical results demonstrate a drastic reduction in the Attack Success Rate from 84.2% to 1.2% while maintaining a high benign utility rate of 94.1% on the Llama-3 model, significantly outperforming existing defenses.
  • Establishment of a certified radius based on the Hypergeometric distribution for discrete token substitutions, effectively correcting misapplied scaling laws found in prior heuristic defenses.
  • Methodological advancement through stratified randomized ablation, preserving the structural integrity of inputs and enabling effective adversarial robustness against multiple attack variants.

💡 Why This Paper Matters

The paper presents a foundational approach to enhancing the security of Large Language Models (LLMs) against adaptive adversarial attacks. By combining certified robustness with effective tuning methods, it offers a significant leap forward in creating invulnerable LLMs, crucial for real-world applications where safety is paramount.

🎯 Why It's Interesting for AI Security Researchers

This research is highly relevant for AI security researchers as it tackles the pressing challenge of adversarial attacks on LLMs, a growing concern in AI deployment. The novel certification methods and empirical results provide a benchmark for future studies aimed at improving LLM safety without sacrificing performance, making it a significant contribution to the field of AI security.

📚 Read the Full Paper