← Back to Library

Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

Authors: Tri Nguyen, Huy Hoang Bao Le, Lohith Srikanth Pentapalli, Laurah Turner, Kelly Cohen

Published: 2026-02-10

arXiv ID: 2602.13321v1

Added to Library: 2026-02-17 03:01 UTC

📄 Abstract

Detecting jailbreak attempts in clinical training large language models (LLMs) requires accurate modeling of linguistic deviations that signal unsafe or off-task user behavior. Prior work on the 2-Sigma clinical simulation platform showed that manually annotated linguistic features could support jailbreak detection. However, reliance on manual annotation limited both scalability and expressiveness. In this study, we extend this framework by using experts' annotations of four core linguistic features (Professionalism, Medical Relevance, Ethical Behavior, and Contextual Distraction) and training multiple general-domain and medical-domain BERT-based LLM models to predict these features directly from text. The most reliable feature regressor for each dimension was selected and used as the feature extractor in a second layer of classifiers. We evaluate a suite of predictive models, including tree-based, linear, probabilistic, and ensemble methods, to determine jailbreak likelihood from the extracted features. Across cross-validation and held-out evaluations, the system achieves strong overall performance, indicating that LLM-derived linguistic features provide an effective basis for automated jailbreak detection. Error analysis further highlights key limitations in current annotations and feature representations, pointing toward future improvements such as richer annotation schemes, finer-grained feature extraction, and methods that capture the evolving risk of jailbreak behavior over the course of a dialogue. This work demonstrates a scalable and interpretable approach for detecting jailbreak behavior in safety-critical clinical dialogue systems.

🔍 Key Points

  • Introduction of the OMNI-LEAK attack vector that compromises multi-agent systems by leaking sensitive data through indirect prompt injections, even with access controls in place.
  • An experimental evaluation of various frontier models reveals that most, except for one (claude-sonnet-4), are susceptible to at least one form of the OMNI-LEAK attack across different setups and database sizes.
  • Development of the first benchmark for assessing the vulnerability of orchestrator multi-agent data leakage, highlighting the serious risks of privacy breaches in practical applications.

💡 Why This Paper Matters

This paper underscores the critical security vulnerabilities inherent in orchestrator multi-agent systems leveraging large language models. By presenting the OMNI-LEAK attack and demonstrating its effectiveness against various models, the authors illustrate the need for comprehensive threat modeling and mitigation strategies as these systems become increasingly integrated into real-world applications.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper are pivotal for AI security researchers as they highlight a previously underexplored area of risk associated with multi-agent systems. This work not only reveals significant vulnerabilities in popular models but also serves as a foundation for devising enhanced safety measures and threat modeling frameworks for future AI deployments.

📚 Read the Full Paper