← Back to Library

Detection Method for Prompt Injection by Integrating Pre-trained Model and Heuristic Feature Engineering

Authors: Yi Ji, Runzhi Li, Baolei Mao

Published: 2025-06-05

arXiv ID: 2506.06384v1

Added to Library: 2025-11-11 14:27 UTC

Red Teaming

📄 Abstract

With the widespread adoption of Large Language Models (LLMs), prompt injection attacks have emerged as a significant security threat. Existing defense mechanisms often face critical trade-offs between effectiveness and generalizability. This highlights the urgent need for efficient prompt injection detection methods that are applicable across a wide range of LLMs. To address this challenge, we propose DMPI-PMHFE, a dual-channel feature fusion detection framework. It integrates a pretrained language model with heuristic feature engineering to detect prompt injection attacks. Specifically, the framework employs DeBERTa-v3-base as a feature extractor to transform input text into semantic vectors enriched with contextual information. In parallel, we design heuristic rules based on known attack patterns to extract explicit structural features commonly observed in attacks. Features from both channels are subsequently fused and passed through a fully connected neural network to produce the final prediction. This dual-channel approach mitigates the limitations of relying only on DeBERTa to extract features. Experimental results on diverse benchmark datasets demonstrate that DMPI-PMHFE outperforms existing methods in terms of accuracy, recall, and F1-score. Furthermore, when deployed actually, it significantly reduces attack success rates across mainstream LLMs, including GLM-4, LLaMA 3, Qwen 2.5, and GPT-4o.

🔍 Key Points

  • Introduction of DMPI-PMHFE, a dual-channel feature fusion framework for detecting prompt injection attacks on LLMs.
  • Integration of a pretrained DeBERTa model for semantic feature extraction alongside heuristic feature engineering for explicit structural feature detection.
  • Experimental validation showing DMPI-PMHFE outperforms existing models in accuracy, recall, and F1-score across diverse datasets.
  • Demonstrated practical effectiveness, significantly reducing attack success rates on popular LLMs like GLM-4, LLaMA 3, Qwen 2.5, and GPT-4o.
  • Ablation studies showcasing the contributions of individual modules within the detection framework.

💡 Why This Paper Matters

This paper offers a significant advancement in securing large language models against prompt injection attacks through the development of a comprehensive detection method. The proposed DMPI-PMHFE framework not only combines cutting-edge deep learning techniques with heuristic strategies but also demonstrates strong empirical results, making it a valuable tool for enhancing the security posture of applications relying on LLMs and addressing a critical vulnerability in the current landscape of AI technologies.

🎯 Why It's Interesting for AI Security Researchers

The relevance of this paper lies in its focused exploration of prompt injection attacks, a growing concern for AI security researchers. The proposed model and its accompanying findings provide a new approach to detecting and mitigating these vulnerabilities, making it essential reading for those involved in AI safety and model robustness. Furthermore, its integration of feature fusion techniques reflects a novel methodology that could inspire further research and development in AI security protocols.

📚 Read the Full Paper