← Back to Library

Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

Authors: Shayan Ali Hassan, Tao Ni, Zafar Ayyub Qazi, Marco Canini

Published: 2026-02-08

arXiv ID: 2602.08062v1

Added to Library: 2026-02-10 05:01 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and generation. However, these systems remain susceptible to malicious prompts that induce unsafe or policy-violating behavior through harmful requests, jailbreak techniques, and prompt injection attacks. Existing defenses face fundamental limitations: black-box moderation APIs offer limited transparency and adapt poorly to evolving threats, while white-box approaches using large LLM judges impose prohibitive computational costs and require expensive retraining for new attacks. Current systems force designers to choose between performance, efficiency, and adaptability. To address these challenges, we present BAGEL (Bootstrap AGgregated Ensemble Layer), a modular, lightweight, and incrementally updatable framework for malicious prompt detection. BAGEL employs a bootstrap aggregation and mixture of expert inspired ensemble of fine-tuned models, each specialized on a different attack dataset. At inference, BAGEL uses a random forest router to identify the most suitable ensemble member, then applies stochastic selection to sample additional members for prediction aggregation. When new attacks emerge, BAGEL updates incrementally by fine-tuning a small prompt-safety classifier (86M parameters) and adding the resulting model to the ensemble. BAGEL achieves an F1 score of 0.92 by selecting just 5 ensemble members (430M parameters), outperforming OpenAI Moderation API and ShieldGemma which require billions of parameters. Performance remains robust after nine incremental updates, and BAGEL provides interpretability through its router's structural features. Our results show ensembles of small finetuned classifiers can match or exceed billion-parameter guardrails while offering the adaptability and efficiency required for production systems.

🔍 Key Points

  • Introduction of BAGEL (Bootstrap AGgregated Ensemble Layer), a modular framework for detecting malicious prompts in LLMs that balances efficiency, performance, and adaptability.
  • Utilizes an ensemble of fine-tuned prompt classifiers, each specializing in different attack types, enabling effective detection while containing a modest 430M parameter footprint.
  • Incorporates dynamic routing via a Random Forest classifier to intelligently select the best-performing model for a specific prompt, enhancing detection accuracy.
  • Achieves a high F1 score of 0.92, outperforming existing methods such as the OpenAI Moderation API and ShieldGemma, which require orders of magnitude more parameters.
  • Demonstrates adaptability through incremental model updates without necessitating full retraining, streamlining the process of incorporating new attack data.

💡 Why This Paper Matters

This paper presents a significant advancement in LLM safety by introducing BAGEL, a lightweight and adaptable framework that effectively addresses the challenges of malicious prompt detection. By leveraging modifiable ensembles of specialized classifiers, BAGEL not only enhances detection performance but also significantly lowers the computational burdens associated with LLM guardrails. This is crucial as LLMs increasingly permeate mission-critical applications, necessitating robust and efficient safety mechanisms.

🎯 Why It's Interesting for AI Security Researchers

The paper is highly relevant to AI security researchers as it directly addresses the growing concern of adversarial attacks on large language models. With the rapid evolution of jailbreaking techniques and prompt injection strategies, understanding and deploying effective detection methods is more important than ever. BAGEL's innovative ensemble approach, combining computational efficiency with high accuracy and interpretability, provides a robust framework that can be adapted to counter emerging threats, making it a valuable resource for researchers aiming to improve LLM security.

📚 Read the Full Paper