← Back to Library

Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks

Authors: Md. Mehedi Hasan, Ziaur Rahman, Rafid Mostafiz, Md. Abir Hossain

Published: 2025-10-26

arXiv ID: 2510.22628v1

Added to Library: 2025-10-28 04:00 UTC

Red Teaming Safety

📄 Abstract

This paper presents a real-time modular defense system named Sentra-Guard. The system detects and mitigates jailbreak and prompt injection attacks targeting large language models (LLMs). The framework uses a hybrid architecture with FAISS-indexed SBERT embedding representations that capture the semantic meaning of prompts, combined with fine-tuned transformer classifiers, which are machine learning models specialized for distinguishing between benign and adversarial language inputs. It identifies adversarial prompts in both direct and obfuscated attack vectors. A core innovation is the classifier-retriever fusion module, which dynamically computes context-aware risk scores that estimate how likely a prompt is to be adversarial based on its content and context. The framework ensures multilingual resilience with a language-agnostic preprocessing layer. This component automatically translates non-English prompts into English for semantic evaluation, enabling consistent detection across over 100 languages. The system includes a HITL feedback loop, where decisions made by the automated system are reviewed by human experts for continual learning and rapid adaptation under adversarial pressure. Sentra-Guard maintains an evolving dual-labeled knowledge base of benign and malicious prompts, enhancing detection reliability and reducing false positives. Evaluation results show a 99.96% detection rate (AUC = 1.00, F1 = 1.00) and an attack success rate (ASR) of only 0.004%. This outperforms leading baselines such as LlamaGuard-2 (1.3%) and OpenAI Moderation (3.7%). Unlike black-box approaches, Sentra-Guard is transparent, fine-tunable, and compatible with diverse LLM backends. Its modular design supports scalable deployment in both commercial and open-source environments. The system establishes a new state-of-the-art in adversarial LLM defense.

🔍 Key Points

  • Introduction of Sentra-Guard, a novel framework designed to detect and mitigate jailbreak and prompt injection attacks targeting large language models (LLMs) in real-time.
  • Utilization of a hybrid architecture combining FAISS-indexed SBERT embeddings and fine-tuned transformer classifiers, achieving a detection accuracy of 99.98% and an attack success rate of just 0.004%.
  • Integration of a language-agnostic preprocessing layer enables effective detection across over 100 languages, facilitating multilingual defenses against adversarial attacks.
  • Implementation of a human-in-the-loop (HITL) feedback system that allows for continuous learning and adaptation to emerging threats without requiring full model retraining.
  • Demonstration of superior performance compared to existing baselines such as LlamaGuard-2 and OpenAI Moderation, particularly in handling obfuscated and zero-day attack scenarios.

💡 Why This Paper Matters

The Sentra-Guard framework represents a significant advancement in the defense mechanisms available for large language models against adversarial prompts. Its exceptional detection capabilities, combined with real-time processing and multilingual support, make it a crucial tool for enhancing the security and reliability of AI systems that depend on LLMs. By incorporating human feedback into its detection process, Sentra-Guard not only achieves high precision but also ensures ongoing adaptability to new threats, thereby protecting against future vulnerabilities.

🎯 Why It's Interesting for AI Security Researchers

This paper is of great interest to AI security researchers as it addresses the pressing issue of adversarial attacks on large language models, a rapidly evolving area of research. Sentra-Guard's innovative methodologies, such as its hybrid architecture and multilingual defenses, provide a robust framework for understanding and countering vulnerabilities inherent in LLMs. Furthermore, the emphasis on real-time detection and the integration of human feedback loops are critical for developing responsive and resilient AI systems in real-world applications, making the findings highly applicable and impactful in the ongoing discourse on AI safety and security.

📚 Read the Full Paper