← Back to Library

Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack

Authors: Peng Ding, Jun Kuang, Wen Sun, Zongyu Wang, Xuezhi Cao, Xunliang Cai, Jiajun Chen, Shujian Huang

Published: 2025-11-01

arXiv ID: 2511.00556v1

Added to Library: 2025-11-05 05:01 UTC

Red Teaming Safety

📄 Abstract

Large language models (LLMs) remain vulnerable to jailbreaking attacks despite their impressive capabilities. Investigating these weaknesses is crucial for robust safety mechanisms. Existing attacks primarily distract LLMs by introducing additional context or adversarial tokens, leaving the core harmful intent unchanged. In this paper, we introduce ISA (Intent Shift Attack), which obfuscates LLMs about the intent of the attacks. More specifically, we establish a taxonomy of intent transformations and leverage them to generate attacks that may be misperceived by LLMs as benign requests for information. Unlike prior methods relying on complex tokens or lengthy context, our approach only needs minimal edits to the original request, and yields natural, human-readable, and seemingly harmless prompts. Extensive experiments on both open-source and commercial LLMs show that ISA achieves over 70% improvement in attack success rate compared to direct harmful prompts. More critically, fine-tuning models on only benign data reformulated with ISA templates elevates success rates to nearly 100%. For defense, we evaluate existing methods and demonstrate their inadequacy against ISA, while exploring both training-free and training-based mitigation strategies. Our findings reveal fundamental challenges in intent inference for LLMs safety and underscore the need for more effective defenses. Our code and datasets are available at https://github.com/NJUNLP/ISA.

🔍 Key Points

  • Introduction of Intent Shift Attack (ISA) as a new method for jailbreaking LLMs through minimal linguistic modifications, significantly changing the perception of intent.
  • Development of a comprehensive taxonomy of intent transformations to categorize various strategies for obfuscating harmful intents into benign-seeming queries.
  • Demonstration through experiments that ISA achieves over 70% improvement in attack success rate compared to traditional harmful prompts, with nearly 100% success when models are fine-tuned exclusively on benign ISA data.
  • Evaluation of existing defense mechanisms reveals their inadequacies against ISA, highlighting critical gaps in current safety strategies for LLMs.
  • Exploration of potential defenses shows a trade-off between safety and utility, emphasizing the need for more sophisticated safety mechanisms that can accurately assess intent.

💡 Why This Paper Matters

This paper is highly relevant as it sheds light on the vulnerabilities of large language models (LLMs) to sophisticated jailbreaking attacks via the Intent Shift Attack. The work not only introduces a novel method that is both effective and subtle but also exposes fundamental weaknesses in LLMs' ability to infer user intent, raising important questions about the integrity of safety mechanisms in AI systems. Moreover, by challenging existing defenses and proposing enhancements, the paper sets the stage for future research aimed at improving AI safety, thus contributing to the broader discourse on responsible AI deployment.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly significant as it highlights new attack vectors against LLMs that bypass traditional defenses, thereby pushing the boundaries of current AI security practices. Understanding ISA's methodology and its implications for intent misinterpretation will be crucial for developing more resilient models and refining defensive strategies. Furthermore, the findings prompt deeper investigation into the intersection of AI safety and human-like interpretative capabilities in language models, a topic of increasing significance in the field.

📚 Read the Full Paper