Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation

Authors: Jiawei Mao, Yuhan Wang, Lifeng Chen, Can Zhao, Yucheng Tang, Dong Yang, Liangqiong Qu, Daguang Xu, Yuyin Zhou

Published: 2025-10-07

arXiv ID: 2510.06131v1

Added to Library: 2025-11-14 23:14 UTC

📄 Abstract

Recent advances in generative medical models are constrained by modality-specific scenarios that hinder the integration of complementary evidence from imaging, pathology, and clinical notes. This fragmentation limits their evolution into foundation models that can learn and reason across the full spectrum of biomedical data. We propose MeDiM, the first medical discrete diffusion model that learns shared distributions across modalities without modality-specific components. MeDiM unifies multiple generative tasks: translating between images and text, and jointly producing image-report pairs across domains in response to prompts. Built on a discrete diffusion framework, MeDiM bridges vision and language representations through a shared probabilistic space. To enable unified and flexible medical generation, we employ a multimodal large language model (MLLM) as the diffusion backbone, leveraging its prior knowledge and cross-modal reasoning. Two key designs are introduced: (1) removing the causal attention mask for bidirectional context, and (2) injecting continuous timestep embeddings for diffusion awareness. Experiments demonstrate high-fidelity medical generation (FID 16.60 on MIMIC-CXR and FID 24.19 on PathGen) and accurate report generation (METEOR 0.2650 and 0.2580). Jointly generated image-report pairs further enhance downstream performance (plus6.43 percent BLEU-1, plus18.57 percent BLEU-2, plus31.58 percent BLEU-3, plus4.80 percent METEOR), showing that MeDiM supports coherent and clinically grounded multimodal outputs.

🔍 Key Points

Systematic evaluation of indirect prompt injection attacks on large language models (LLMs), highlighting vulnerabilities across various models and implementations.
Identification of key factors influencing model susceptibility including size and architecture, revealing persistent weaknesses even in advanced models.
Development of novel obfuscation techniques utilized in the attack scenarios, allowing adversaries to exploit models through hidden instructions embedded in seemingly benign inputs.
Empirical evidence showcasing varying degrees of resilience among different LLMs, with some models exhibiting alarming rates of successful attacks despite advanced security mechanisms.
Recommendations for improving LLM defenses, including a centralized database of attack vectors and the integration of security into model training processes.

💡 Why This Paper Matters

This paper is critically relevant in addressing the emerging threats posed by indirect prompt injection attacks on LLMs, underscoring the necessity for enhanced security frameworks in AI applications. The findings not only highlight significant vulnerabilities in existing models but also provide a structured approach for future developments in AI security protocols, making it a pivotal resource for safeguarding corporate data against unprecedented threats.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper offers crucial insights into the evolving landscape of vulnerabilities associated with LLMs, particularly in the context of their integration with external data sources. The empirical data on attack success rates, coupled with a comprehensive analysis of obfuscation techniques, serves as a foundational study for understanding and mitigating security threats in generative AI systems. Furthermore, the proposed frameworks and defensive strategies can guide researchers in developing robust countermeasures against increasingly sophisticated adversarial tactics.

Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper