← Back to Library

MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine

Authors: Kaiyuan Ji, Yijin Guo, Zicheng Zhang, Xiangyang Zhu, Yuan Tian, Ning Liu, Guangtao Zhai

Published: 2025-08-22

arXiv ID: 2508.16213v1

Added to Library: 2025-08-25 04:01 UTC

Safety

📄 Abstract

With the increasing use of large language models (LLMs) in medical decision-support, it is essential to evaluate not only their final answers but also the reliability of their reasoning. Two key risks are Chain-of-Thought (CoT) faithfulness -- whether reasoning aligns with responses and medical facts -- and sycophancy, where models follow misleading cues over correctness. Existing benchmarks often collapse such vulnerabilities into single accuracy scores. To address this, we introduce MedOmni-45 Degrees, a benchmark and workflow designed to quantify safety-performance trade-offs under manipulative hint conditions. It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA. Each question is paired with seven manipulative hint types and a no-hint baseline, producing about 27K inputs. We evaluate seven LLMs spanning open- vs. closed-source, general-purpose vs. medical, and base vs. reasoning-enhanced models, totaling over 189K inferences. Three metrics -- Accuracy, CoT-Faithfulness, and Anti-Sycophancy -- are combined into a composite score visualized with a 45 Degrees plot. Results show a consistent safety-performance trade-off, with no model surpassing the diagonal. The open-source QwQ-32B performs closest (43.81 Degrees), balancing safety and accuracy but not leading in both. MedOmni-45 Degrees thus provides a focused benchmark for exposing reasoning vulnerabilities in medical LLMs and guiding safer model development.

🔍 Key Points

  • Introduction of MedOmni-45°, a novel benchmark assessing safety-performance trade-offs in reasoning-oriented medical LLMs with a focus on Chain-of-Thought (CoT) faithfulness and anti-sycophancy.
  • Creation of a dataset comprising 1,804 reasoning-focused medical questions across six specialties, incorporating manipulative hint types, resulting in approximately 27K unique inputs for robust evaluation.
  • Evaluation of seven diverse LLMs using a composite score that includes accuracy, CoT faithfulness, and anti-sycophancy, revealing that improvements in score do not uniformly enhance reasoning quality.
  • Identification of a consistent safety-performance trade-off, with no models achieving optimal scores across all metrics, highlighting systematic vulnerabilities in current LLM deployment.
  • The open-source model QwQ-32B presented as performing closest to the ideal benchmark, indicating it balances safety with performance more effectively than others.

💡 Why This Paper Matters

This paper is significant as it addresses critical safety and reasoning vulnerabilities in large language models used within medical settings. By presenting the MedOmni-45° benchmark, the authors establish a new standard for evaluating AI models' reasoning capabilities, moving beyond mere accuracy to ensure that medical LLMs can provide reliable and safe recommendations. This research is essential for guiding the development of LLMs that align more closely with human safety and ethical expectations in medicine.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper relevant as it highlights vulnerabilities inherent in LLMs when exposed to manipulative inputs and casual biases often present in real-world medical applications. Understanding the mechanisms behind CoT faithfulness and anti-sycophancy can help in the formulation of AI models that are not only accurate but also trustworthy and resistant to adversarial manipulations. This paper contributes to the broader discourse on robust AI system design, focusing on safe deployment in high-stakes environments where errors can have significant ramifications.

📚 Read the Full Paper