Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

Authors: Ziqi Miao, Lijun Li, Yuan Xiong, Zhenhua Liu, Pengyu Zhu, Jing Shao

Published: 2025-07-07

arXiv ID: 2507.05248v1

Added to Library: 2025-07-08 04:02 UTC

Red Teaming

📄 Abstract

Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct trigger prompt, thereby priming the target model to generate harmful content. Across eight open-source and proprietary LLMs, RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces the attack success rate while preserving model capabilities. The code and data are available at https://github.com/Dtc7w3PQ/Response-Attack.

🤖 AI Analysis

AI analysis is not available for this paper. This may be because the paper was not deemed relevant for AI security topics, or the analysis failed during processing.

Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

📄 Abstract

🤖 AI Analysis

📚 Read the Full Paper