← Back to Library

MPIB: A Benchmark for Medical Prompt Injection Attacks and Clinical Safety in LLMs

Authors: Junhyeok Lee, Han Jang, Kyu Sung Choi

Published: 2026-02-06

arXiv ID: 2602.06268v1

Added to Library: 2026-02-09 03:01 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly integrated into clinical workflows; however, prompt injection attacks can steer these systems toward clinically unsafe or misleading outputs. We introduce the Medical Prompt Injection Benchmark (MPIB), a dataset-and-benchmark suite for evaluating clinical safety under both direct prompt injection and indirect, RAG-mediated injection across clinically grounded tasks. MPIB emphasizes outcome-level risk via the Clinical Harm Event Rate (CHER), which measures high-severity clinical harm events under a clinically grounded taxonomy, and reports CHER alongside Attack Success Rate (ASR) to disentangle instruction compliance from downstream patient risk. The benchmark comprises 9,697 curated instances constructed through multi-stage quality gates and clinical safety linting. Evaluating MPIB across a diverse set of baseline LLMs and defense configurations, we find that ASR and CHER can diverge substantially, and that robustness depends critically on whether adversarial instructions appear in the user query or in retrieved context. We release MPIB with evaluation code, adversarial baselines, and comprehensive documentation to support reproducible and systematic research on clinical prompt injection. Code and data are available at GitHub (code) and Hugging Face (data).

🔍 Key Points

  • Introduction of the Medical Prompt Injection Benchmark (MPIB) which evaluates clinical safety in Large Language Models (LLMs) under both direct and indirect prompt injection scenarios.
  • Development of the Clinical Harm Event Rate (CHER) metric alongside Attack Success Rate (ASR) to measure high-severity clinical harm events specifically in healthcare contexts.
  • MPIB dataset comprises 9,697 carefully curated clinical scenarios constructed with clinical safety in mind, allowing for systematic testing of current models against adversarial injections.
  • Empirical analysis demonstrates significant divergence between ASR and CHER, highlighting that higher compliance (ASR) does not necessarily imply lower clinical risk (CHER).
  • The study releases comprehensive evaluation resources including code and data, supporting reproducibility and facilitating future research on adversarial robustness in healthcare settings.

💡 Why This Paper Matters

The paper is significant as it establishes a critical framework for assessing the safety of AI models in clinical environments, where incorrect outputs can lead to serious patient harm. By focusing on the measurement of potential clinical harm through MPIB and the innovative use of CHER alongside traditional ASR metrics, the authors provide a new lens through which to evaluate the effectiveness of language models in high-stakes situations. The findings emphasize the need for robustness in AI systems deployed in healthcare, reinforcing the importance of ongoing vigilance against adversarial attacks.

🎯 Why It's Interesting for AI Security Researchers

This paper is crucial for AI security researchers because it addresses a pressing concern regarding the safety and reliability of AI-driven systems, particularly in healthcare. It introduces novel evaluation metrics and methods for testing vulnerabilities to prompt injection attacks, which can maliciously influence the output of healthcare AI systems. As AI continues to permeate clinical workflows, ensuring that these systems can resist adversarial manipulations is vital for patient safety, making this research particularly relevant for both the security and health informatics communities.

📚 Read the Full Paper