← Back to Library

How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks

Authors: Yanshu Wang, Shuaishuai Yang, Jingjing He, Tong Yang

Published: 2026-02-04

arXiv ID: 2602.04294v1

Added to Library: 2026-02-05 03:02 UTC

Red Teaming Safety

📄 Abstract

Large Language Models (LLMs) face increasing threats from jailbreak attacks that bypass safety alignment. While prompt-based defenses such as Role-Oriented Prompts (RoP) and Task-Oriented Prompts (ToP) have shown effectiveness, the role of few-shot demonstrations in these defense strategies remains unclear. Prior work suggests that few-shot examples may compromise safety, but lacks investigation into how few-shot interacts with different system prompt strategies. In this paper, we conduct a comprehensive evaluation on multiple mainstream LLMs across four safety benchmarks (AdvBench, HarmBench, SG-Bench, XSTest) using six jailbreak attack methods. Our key finding reveals that few-shot demonstrations produce opposite effects on RoP and ToP: few-shot enhances RoP's safety rate by up to 4.5% through reinforcing role identity, while it degrades ToP's effectiveness by up to 21.2% through distracting attention from task instructions. Based on these findings, we provide practical recommendations for deploying prompt-based defenses in real-world LLM applications.

🔍 Key Points

  • The paper provides the first systematic investigation of how few-shot demonstrations interact differently with Role-Oriented Prompts (RoP) and Task-Oriented Prompts (ToP), identifying opposing effects: few-shot enhances RoP's effectiveness by up to 4.5% while degrading ToP's performance by up to 21.2%.
  • It develops a mathematical framework based on Bayesian in-context learning to explain the observed interactions, providing formal theorems about how few-shot examples reinforce role associations in RoP but distract from task instructions in ToP.
  • The study offers empirical validation across multiple mainstream Large Language Models (LLMs) using four safety benchmarks and six distinct jailbreak attack methods, demonstrating the robustness of its findings across various contexts and model architectures.
  • Practical recommendations are provided for deploying prompt-based defenses, suggesting the combination of RoP and few-shot examples to improve safety while advising against using ToP with few-shot.
  • The research highlights the 'think mode paradox,' where models designed for reasoning exhibit heightened vulnerabilities to both jailbreak attacks and negative interaction with few-shot demonstrations.

💡 Why This Paper Matters

This paper is critical as it not only fills a notable gap in understanding the interaction dynamics between few-shot demonstrations and prompt-based defense strategies but also emphasizes practical implications for securely deploying LLMs in real-world applications. The insights gained can significantly enhance the robustness of AI systems against adversarial attacks, thereby contributing to safer AI deployment.

🎯 Why It's Interesting for AI Security Researchers

The findings and methodologies presented in this paper would be of significant interest to AI security researchers because they delve into the complexities of LLM vulnerabilities and the effectiveness of defense strategies against harmful jailbreaking attempts. Understanding these interactions is crucial in developing more resilient AI systems, and the recommendations can directly inform defense mechanisms in practical applications of language models.

📚 Read the Full Paper