← Back to Library

Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

Authors: Ziqi Miao, Lijun Li, Yuan Xiong, Zhenhua Liu, Pengyu Zhu, Jing Shao

Published: 2025-07-07

arXiv ID: 2507.05248v1

Added to Library: 2025-07-08 04:02 UTC

Red Teaming

📄 Abstract

Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct trigger prompt, thereby priming the target model to generate harmful content. Across eight open-source and proprietary LLMs, RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces the attack success rate while preserving model capabilities. The code and data are available at https://github.com/Dtc7w3PQ/Response-Attack.

🤖 AI Analysis

AI analysis is not available for this paper. This may be because the paper was not deemed relevant for AI security topics, or the analysis failed during processing.

📚 Read the Full Paper