← Back to Library

Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs

Authors: Sohely Jahan, Ruimin Sun

Published: 2025-12-10

arXiv ID: 2512.09403v1

Added to Library: 2025-12-11 03:01 UTC

Safety

📄 Abstract

As medical large language models (LLMs) become increasingly integrated into clinical workflows, concerns around alignment robustness, and safety are escalating. Prior work on model extraction has focused on classification models or memorization leakage, leaving the vulnerability of safety-aligned generative medical LLMs underexplored. We present a black-box distillation attack that replicates the domain-specific reasoning of safety-aligned medical LLMs using only output-level access. By issuing 48,000 instruction queries to Meditron-7B and collecting 25,000 benign instruction response pairs, we fine-tune a LLaMA3 8B surrogate via parameter efficient LoRA under a zero-alignment supervision setting, requiring no access to model weights, safety filters, or training data. With a cost of $12, the surrogate achieves strong fidelity on benign inputs while producing unsafe completions for 86% of adversarial prompts, far exceeding both Meditron-7B (66%) and the untuned base model (46%). This reveals a pronounced functional-ethical gap, task utility transfers, while alignment collapses. To analyze this collapse, we develop a dynamic adversarial evaluation framework combining Generative Query (GQ)-based harmful prompt generation, verifier filtering, category-wise failure analysis, and adaptive Random Search (RS) jailbreak attacks. We also propose a layered defense system, as a prototype detector for real-time alignment drift in black-box deployments. Our findings show that benign-only black-box distillation exposes a practical and under-recognized threat: adversaries can cheaply replicate medical LLM capabilities while stripping safety mechanisms, underscoring the need for extraction-aware safety monitoring.

🔍 Key Points

  • The paper introduces a dark-box distillation attack specifically targeting safety-aligned medical LLMs, demonstrating that adversaries can replicate the core functionality of these systems while stripping away their safety mechanisms.
  • Using a dataset of 25,000 benign instruction-response pairs, the authors fine-tune a surrogate model that proves 86% unsafe on adversarial prompts, highlighting a significant drop in safety alignment compared to the original model (Meditron-7B).
  • A novel dynamic adversarial evaluation framework is proposed, which combines Generative Query (GQ) methodologies with adaptive attack strategies to rigorously test safety alignment failures in distilled models.
  • The findings reveal that larger training datasets improve semantic fidelity but simultaneously amplify the risks of unsafe outputs under zero-alignment supervision, underscoring a critical tension between performance utility and safety adherence.
  • The authors present DistillGuard++, a prototype detection system designed to identify and monitor alignment degradation in medical LLMs, advocating for more robust defenses against extraction attacks.

💡 Why This Paper Matters

This paper is crucial in highlighting vulnerabilities in the safety alignment of medical LLMs exposed via API interfaces. The findings not only illustrate a tangible threat posed by black-box distillation attacks but also prompt a reconsideration of safety mechanisms in high-stakes AI applications, particularly in healthcare. The demonstrable capability to strip safety from aligned models underlines the urgent need for stronger protections and detection systems installed in future LLM deployments.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper provides critical insights into the mechanisms through which adversarial actors can exploit medical LLMs. It offers a thorough examination of both theoretical and practical vulnerabilities within current systems, promoting an understanding of the need for improved alignment preservation strategies and defensive methodologies, essential for advancing secure AI deployments in sensitive sectors.

📚 Read the Full Paper