← Back to Library

Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs

Authors: Jinhwa Kim, Ian G. Harris

Published: 2025-08-09

arXiv ID: 2508.10031v1

Added to Library: 2025-08-15 04:01 UTC

📄 Abstract

While Large Language Models (LLMs) have shown significant advancements in performance, various jailbreak attacks have posed growing safety and ethical risks. Malicious users often exploit adversarial context to deceive LLMs, prompting them to generate responses to harmful queries. In this study, we propose a new defense mechanism called Context Filtering model, an input pre-processing method designed to filter out untrustworthy and unreliable context while identifying the primary prompts containing the real user intent to uncover concealed malicious intent. Given that enhancing the safety of LLMs often compromises their helpfulness, potentially affecting the experience of benign users, our method aims to improve the safety of the LLMs while preserving their original performance. We evaluate the effectiveness of our model in defending against jailbreak attacks through comparative analysis, comparing our approach with state-of-the-art defense mechanisms against six different attacks and assessing the helpfulness of LLMs under these defenses. Our model demonstrates its ability to reduce the Attack Success Rates of jailbreak attacks by up to 88% while maintaining the original LLMs' performance, achieving state-of-the-art Safety and Helpfulness Product results. Notably, our model is a plug-and-play method that can be applied to all LLMs, including both white-box and black-box models, to enhance their safety without requiring any fine-tuning of the models themselves. We will make our model publicly available for research purposes.

🔍 Key Points

  • Identifies a previously unrecognized vulnerability in large language models (LLMs) using thinking mode, revealing they are more susceptible to jailbreak attacks compared to non-thinking mode models.
  • Proposes a novel defense mechanism called 'safe thinking intervention', which incorporates specific tokens into prompts to guide LLM reasoning and improve safety during thought processes.
  • Demonstrates through extensive experiments that the proposed intervention method significantly decreases the attack success rates (ASR) of jailbreak attacks on LLMs.
  • Provides insights into the systematic characteristics of successful jailbreak attacks, showing that LLMs tend to generate harmful responses even when aware of their inappropriate nature.
  • Implements a human-annotated evaluation approach alongside an LLM voting mechanism for assessing harmfulness, demonstrating its superior precision compared to traditional keyword-based detection methods.

💡 Why This Paper Matters

This paper is significant as it uncovers a critical aspect of LLMs’ security, particularly emphasizing how advanced reasoning capabilities might introduce novel vulnerabilities during thought generation. The proposed safe thinking intervention method offers a promising approach to mitigating such risks, thus enhancing AI safety and robustness, which is increasingly paramount as these models become integrated into sensitive applications.

🎯 Why It's Interesting for AI Security Researchers

This paper is of great interest to AI security researchers as it provides empirical evidence of vulnerabilities associated with the reasoning capabilities in LLMs, a topic that has not been thoroughly explored. The findings and defense strategies could lead to improved models that not only perform tasks effectively but also uphold ethical and safety standards, an essential consideration in AI deployment.

📚 Read the Full Paper