← Back to Library

MTCMB: A Multi-Task Benchmark Framework for Evaluating LLMs on Knowledge, Reasoning, and Safety in Traditional Chinese Medicine

Authors: Shufeng Kong, Xingru Yang, Yuanyuan Wei, Zijie Wang, Hao Tang, Jiuqi Qin, Shuting Lan, Yingheng Wang, Junwen Bai, Zhuangbin Chen, Zibin Zheng, Caihua Liu, Hao Liang

Published: 2025-06-02

arXiv ID: 2506.01252v1

Added to Library: 2025-06-04 04:01 UTC

Safety

📄 Abstract

Traditional Chinese Medicine (TCM) is a holistic medical system with millennia of accumulated clinical experience, playing a vital role in global healthcare-particularly across East Asia. However, the implicit reasoning, diverse textual forms, and lack of standardization in TCM pose major challenges for computational modeling and evaluation. Large Language Models (LLMs) have demonstrated remarkable potential in processing natural language across diverse domains, including general medicine. Yet, their systematic evaluation in the TCM domain remains underdeveloped. Existing benchmarks either focus narrowly on factual question answering or lack domain-specific tasks and clinical realism. To fill this gap, we introduce MTCMB-a Multi-Task Benchmark for Evaluating LLMs on TCM Knowledge, Reasoning, and Safety. Developed in collaboration with certified TCM experts, MTCMB comprises 12 sub-datasets spanning five major categories: knowledge QA, language understanding, diagnostic reasoning, prescription generation, and safety evaluation. The benchmark integrates real-world case records, national licensing exams, and classical texts, providing an authentic and comprehensive testbed for TCM-capable models. Preliminary results indicate that current LLMs perform well on foundational knowledge but fall short in clinical reasoning, prescription planning, and safety compliance. These findings highlight the urgent need for domain-aligned benchmarks like MTCMB to guide the development of more competent and trustworthy medical AI systems. All datasets, code, and evaluation tools are publicly available at: https://github.com/Wayyuanyuan/MTCMB.

🔍 Key Points

  • Introduction of the MTCMB, the first multi-task benchmark for evaluating LLMs specifically in the domain of Traditional Chinese Medicine (TCM) encompassing knowledge, reasoning, and safety tasks.
  • MTCMB consists of 12 sub-datasets across five major categories, including knowledge QA, language understanding, diagnostic reasoning, prescription generation, and safety evaluation, developed in collaboration with TCM experts.
  • Evaluation of 14 state-of-the-art LLMs reveals that while these models perform well in factual recall, they lack proficiency in TCM-specific reasoning and safety compliance, highlighting significant gaps that need to be addressed.
  • The benchmark's analysis indicates that existing models are not aligned with the holistic and context-sensitive nature of TCM, suggesting the need for new training paradigms, hybrid architectures, and safety-focused learning approaches to improve TCM AI systems.
  • The public availability of the MTCMB datasets, evaluation metrics, and model scoring tools supports reproducible research and encourages further development in TCM-capable AI systems.

💡 Why This Paper Matters

This paper presents the MTCMB benchmark, which is crucial for systematically evaluating and advancing AI models in the specialized context of Traditional Chinese Medicine. Its findings emphasize the existing gaps in model capabilities regarding clinical reasoning and safety, which are critical for the safe and effective deployment of AI in healthcare applications. The comprehensive approach taken by this benchmark lays the groundwork for developing more competent and reliable medical AI systems that can be safely integrated into TCM practices.

🎯 Why It's Interesting for AI Security Researchers

The MTCMB benchmark is of particular interest to AI security researchers as it uncovers the vulnerabilities present in existing LLMs related to safety and reliability in critical healthcare scenarios. Understanding the performance gaps in models when generating TCM prescriptions can inform the development of robust safety mechanisms and protocols that mitigate the risks of AI hallucinations, inaccuracies, and potentially harmful recommendations, which are essential for ensuring user trust and safety in AI applications.

📚 Read the Full Paper