Strategic Deflection: Defending LLMs from Logit Manipulation

Authors: Yassine Rachidy, Jihad Rbaiti, Youssef Hmamouche, Faissal Sehbaoui, Amal El Fallah Seghrouchni

Published: 2025-07-29

arXiv ID: 2507.22160v1

Added to Library: 2025-07-31 04:01 UTC

Red Teaming

📄 Abstract

With the growing adoption of Large Language Models (LLMs) in critical areas, ensuring their security against jailbreaking attacks is paramount. While traditional defenses primarily rely on refusing malicious prompts, recent logit-level attacks have demonstrated the ability to bypass these safeguards by directly manipulating the token-selection process during generation. We introduce Strategic Deflection (SDeflection), a defense that redefines the LLM's response to such advanced attacks. Instead of outright refusal, the model produces an answer that is semantically adjacent to the user's request yet strips away the harmful intent, thereby neutralizing the attacker's harmful intent. Our experiments demonstrate that SDeflection significantly lowers Attack Success Rate (ASR) while maintaining model performance on benign queries. This work presents a critical shift in defensive strategies, moving from simple refusal to strategic content redirection to neutralize advanced threats.

🔍 Key Points

Introduction of Strategic Deflection (SDeflection) to enhance security of LLMs against logit manipulation attacks, moving beyond traditional refusal-based defense strategies.
Demonstration of SDeflection's effectiveness by significantly reducing the Attack Success Rate (ASR) on various models while maintaining performance on benign queries.
Empirical results showing SDeflection's superior performance over existing defense methods, such as Deep Alignment, against sophisticated adversarial attacks like LogitsTrap.
Utilization of Contrastive Preference Optimization (CPO) for training, achieving lower ASR with reduced computational complexity compared to Direct Preference Optimization (DPO).
Acknowledgment of ethical considerations and potential dual-use concerns related to research on adversarial attacks against LLMs.

💡 Why This Paper Matters

This paper is significant as it addresses a critical vulnerability in Large Language Models (LLMs) that traditional safety measures fail to mitigate. The development and validation of the SDeflection defense mechanism represent a novel step in strengthening AI systems against increasingly sophisticated adversarial attacks, making it highly relevant in the field of AI ethics and safety.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper would pique the interest of AI security researchers due to the growing threats posed by logit manipulation and the challenges they present to LLM safety. The introduction of SDeflection provides a promising new methodology for enhancing model robustness, which is crucial for developing secure AI applications in sensitive environments. Additionally, the paper contributes to the ongoing discourse on AI safety and ethics, making it valuable for researchers focused on mitigating adversarial risks.

Strategic Deflection: Defending LLMs from Logit Manipulation

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper