← Back to Library

Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling

Authors: Minghui Li, Hao Zhang, Yechao Zhang, Wei Wan, Shengshan Hu, pei Xiaobing, Jing Wang

Published: 2025-09-09

arXiv ID: 2509.07617v1

Added to Library: 2025-11-11 14:22 UTC

Red Teaming

📄 Abstract

Direct Prompt Injection (DPI) attacks pose a critical security threat to Large Language Models (LLMs) due to their low barrier of execution and high potential damage. To address the impracticality of existing white-box/gray-box methods and the poor transferability of black-box methods, we propose an activations-guided prompt injection attack framework. We first construct an Energy-based Model (EBM) using activations from a surrogate model to evaluate the quality of adversarial prompts. Guided by the trained EBM, we employ the token-level Markov Chain Monte Carlo (MCMC) sampling to adaptively optimize adversarial prompts, thereby enabling gradient-free black-box attacks. Experimental results demonstrate our superior cross-model transferability, achieving 49.6% attack success rate (ASR) across five mainstream LLMs and 34.6% improvement over human-crafted prompts, and maintaining 36.6% ASR on unseen task scenarios. Interpretability analysis reveals a correlation between activations and attack effectiveness, highlighting the critical role of semantic patterns in transferable vulnerability exploitation.

🔍 Key Points

  • Introduction of an activation-guided prompt injection attack framework that improves the performance of black-box attacks on LLMs.
  • Development of an Energy-based Model (EBM) that evaluates adversarial prompts based on internal activations of a surrogate model, allowing for optimized adversarial prompt generation without querying the victim model.
  • Utilization of token-level Markov Chain Monte Carlo (MCMC) sampling to effectively generate diverse adversarial prompts while maintaining naturalness and interpretability.
  • Demonstration of superior transferability with 49.6% attack success rate (ASR) across five mainstream LLMs, along with high performance on unseen task scenarios.
  • Interpretability analysis corroborates that prompt effectiveness is strongly associated with specific activation patterns, enhancing understanding of prompt injection vulnerabilities.

💡 Why This Paper Matters

This paper presents a significant advancement in the security analysis of Large Language Models (LLMs) by addressing the critical threat of direct prompt injection attacks. The proposed method improves attack success rates and robustness against various models and settings, thereby contributing to the ongoing research on LLM vulnerabilities and the necessity of effective security measures.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly compelling as it tackles pressing challenges in the domain of LLM security. The novel approaches to prompt injection attacks, coupled with empirical results demonstrating effective transferability, offer valuable insights for strengthening adversarial resilience and inform the design of future defenses against such vulnerabilities.

📚 Read the Full Paper