GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks

Authors: Rongchang Li, Minjie Chen, Chang Hu, Han Chen, Wenpeng Xing, Meng Han

Published: 2024-09-29

arXiv ID: 2409.19521v1

Added to Library: 2025-11-11 14:17 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) like GPT-4, LLaMA, and Qwen have demonstrated remarkable success across a wide range of applications. However, these models remain inherently vulnerable to prompt injection attacks, which can bypass existing safety mechanisms, highlighting the urgent need for more robust attack detection methods and comprehensive evaluation benchmarks. To address these challenges, we introduce GenTel-Safe, a unified framework that includes a novel prompt injection attack detection method, GenTel-Shield, along with a comprehensive evaluation benchmark, GenTel-Bench, which compromises 84812 prompt injection attacks, spanning 3 major categories and 28 security scenarios. To prove the effectiveness of GenTel-Shield, we evaluate it together with vanilla safety guardrails against the GenTel-Bench dataset. Empirically, GenTel-Shield can achieve state-of-the-art attack detection success rates, which reveals the critical weakness of existing safeguarding techniques against harmful prompts. For reproducibility, we have made the code and benchmarking dataset available on the project page at https://gentellab.github.io/gentel-safe.github.io/.

🔍 Key Points

Introduction of GenTel-Safe framework which includes GenTel-Shield for prompt injection attack detection and GenTel-Bench for evaluation.
GenTel-Bench comprises 84,812 prompt injection attack samples across 3 major categories and 28 security scenarios, offering extensive coverage for benchmarking defenses.
GenTel-Shield achieved state-of-the-art performance with detection success rates of 97.63% for jailbreak attacks and 96.81% for target hijacking attacks, significantly outperforming existing defense models.
The study highlights critical weaknesses in current safety techniques and presents data augmentation strategies that enhance the robustness and accuracy of prompt evaluation.
Open sourcing of GenTel-Bench dataset and GenTel-Shield model promotes reproducibility and further research into LLM safety measures.

💡 Why This Paper Matters

This research is pivotal as it addresses the pressing need for improved defenses against prompt injection attacks in large language models. The introduction of GenTel-Safe introduces systematic evaluation methods and novel detection techniques that significantly enhance LLM security, ultimately contributing to safer AI applications across sectors.

🎯 Why It's Interesting for AI Security Researchers

This paper is of significant interest to AI security researchers as it presents a comprehensive framework for addressing the vulnerabilities of large language models (LLMs) to prompt injection attacks. Researchers will find the detailed evaluation methods, robust performance metrics, and the introduction of a substantial benchmarking dataset critical for developing and testing AI safety solutions.

GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper