← Back to Library

Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers

Authors: Liang Lin, Zhihao Xu, Xuehai Tang, Shi Liu, Biyu Zhou, Fuqing Zhu, Jizhong Han, Songlin Hu

Published: 2025-07-17

arXiv ID: 2507.13474v1

Added to Library: 2025-07-21 04:01 UTC

Red Teaming Safety

📄 Abstract

The safety of large language models (LLMs) has garnered significant research attention. In this paper, we argue that previous empirical studies demonstrate LLMs exhibit a propensity to trust information from authoritative sources, such as academic papers, implying new possible vulnerabilities. To verify this possibility, a preliminary analysis is designed to illustrate our two findings. Based on this insight, a novel jailbreaking method, Paper Summary Attack (\llmname{PSA}), is proposed. It systematically synthesizes content from either attack-focused or defense-focused LLM safety paper to construct an adversarial prompt template, while strategically infilling harmful query as adversarial payloads within predefined subsections. Extensive experiments show significant vulnerabilities not only in base LLMs, but also in state-of-the-art reasoning model like Deepseek-R1. PSA achieves a 97\% attack success rate (ASR) on well-aligned models like Claude3.5-Sonnet and an even higher 98\% ASR on Deepseek-R1. More intriguingly, our work has further revealed diametrically opposed vulnerability bias across different base models, and even between different versions of the same model, when exposed to either attack-focused or defense-focused papers. This phenomenon potentially indicates future research clues for both adversarial methodologies and safety alignment.Code is available at https://github.com/233liang/Paper-Summary-Attack

🔍 Key Points

  • The Paper Summary Attack (PSA) framework is introduced as a novel attack technique that exploits large language model (LLM) vulnerabilities by manipulating academic papers as adversarial prompts.
  • PSA achieves high attack success rates (ASR) of 97% and 98% on models like Claude3.5-Sonnet and Deepseek-R1 respectively, demonstrating substantial weaknesses in existing safety mechanisms.
  • The paper provides empirical evidence that LLMs are particularly susceptible to manipulation when exposed to LLM safety papers, revealing critical biases in how models respond to different types of external knowledge.
  • Findings indicate significant variability in vulnerability based on the type of paper (attack-focused vs. defense-focused), suggesting a need to rethink current alignment strategies for enhanced LLM safety.
  • The study emphasizes the importance of understanding and mitigating alignment biases, as existing defenses often fail to protect LLMs from contextually rich adversarial prompts.

💡 Why This Paper Matters

This paper is crucial as it highlights significant security vulnerabilities in large language models and presents an effective attack vector leveraging academic literature, which is generally trusted by these models. The findings not only illuminate inherent weaknesses in model alignments and safety protocols but also lay the groundwork for future research aimed at developing more robust and reliable LLMs.

🎯 Why It's Interesting for AI Security Researchers

This paper is of great interest to AI security researchers as it exposes a novel vulnerability in LLMs that can be exploited using academic papers, thus prompting a reevaluation of safety measures and alignment methodologies. Researchers focusing on adversarial machine learning and model safety can utilize the insights from this study to devise stronger defenses against similar manipulation techniques.

📚 Read the Full Paper