MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

Authors: Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu, Ruizhe Li, Maheep Chaudhary

Published: 2026-02-21

arXiv ID: 2602.18782v1

Added to Library: 2026-02-24 03:01 UTC

Safety

📄 Abstract

Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs.

🔍 Key Points

MANATEE introduces a novel, inference-time lightweight defense mechanism for LLMs that prevents adversarial jailbreak attacks without requiring harmful training data or architectural modifications.
The method uses density estimation through diffusion models to project anomalous representations back towards benign regions, effectively decreasing the Attack Success Rate (ASR) for various models tested.
Experiments indicate that MANATEE can achieve up to 100% ASR reduction across multiple models, thereby demonstrating strong efficacy against adversarial attacks while maintaining the functionality of benign inputs.
The approach reframes the problem of safety in LLMs by focusing on representing benign behaviors and correcting anomalies rather than classifying inputs as harmful or benign.
Through rigorous testing on multiple datasets, including the JailbreakBench and Mechanistic Anomaly Detection datasets, MANATEE proves robust across different models and contexts.

💡 Why This Paper Matters

The research presented in this paper is crucial in advancing the safety mechanisms for Large Language Models (LLMs), particularly in addressing the vulnerabilities posed by sophisticated adversarial attacks. By providing a lightweight, efficient, and adaptable defense system like MANATEE, the authors pave the way for more secure deployment of LLMs in real-world applications, ensuring that these powerful AI systems adhere to safety protocols without compromising their operational capabilities.

🎯 Why It's Interesting for AI Security Researchers

This paper will be of significant interest to AI security researchers because it tackles a pressing issue in the field: the susceptibility of LLMs to adversarial manipulations. The innovative use of diffusion models for anomaly correction offers a promising direction for future research, suggesting methodologies that could enhance the safety and reliability of AI systems. Furthermore, the findings underscore the importance of developing defenses that do not rely on constant retraining or extensive datasets, an area that is critical for the practical deployment of AI in sensitive environments.

MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper