← Back to Library

MCP-RiskCue: Can LLM Infer Risk Information From MCP Server System Logs?

Authors: Jiayi Fu, Qiyao Sun

Published: 2025-11-08

arXiv ID: 2511.05867v2

Added to Library: 2025-11-14 23:04 UTC

πŸ“„ Abstract

Large language models (LLMs) demonstrate strong capabilities in solving complex tasks when integrated with external tools. The Model Context Protocol (MCP) has become a standard interface for enabling such tool-based interactions. However, these interactions introduce substantial security concerns, particularly when the MCP server is compromised or untrustworthy. While prior benchmarks primarily focus on prompt injection attacks or analyze the vulnerabilities of LLM MCP interaction trajectories, limited attention has been given to the underlying system logs associated with malicious MCP servers. To address this gap, we present the first synthetic benchmark for evaluating LLMs ability to identify security risks from system logs. We define nine categories of MCP server risks and generate 1,800 synthetic system logs using ten state-of-the-art LLMs. These logs are embedded in the return values of 243 curated MCP servers, yielding a dataset of 2,421 chat histories for training and 471 queries for evaluation. Our pilot experiments reveal that smaller models often fail to detect risky system logs, leading to high false negatives. While models trained with supervised fine-tuning (SFT) tend to over-flag benign logs, resulting in elevated false positives, Reinforcement Learning from Verifiable Reward (RLVR) offers a better precision-recall balance. In particular, after training with Group Relative Policy Optimization (GRPO), Llama3.1-8B-Instruct achieves 83% accuracy, surpassing the best-performing large remote model by 9 percentage points. Fine-grained, per-category analysis further underscores the effectiveness of reinforcement learning in enhancing LLM safety within the MCP framework. Code and data are available at: https://github.com/PorUna-byte/MCP-RiskCue

πŸ” Key Points

  • Introduction of the Ο‡mera framework as the first principled attack evaluation method on LLM factual memory under prompt injection in adversarial scenarios.
  • Demonstration of various MitM attacks categorized into Ξ±, Ξ², and Ξ³ types, showcasing how even trivial instruction-based attacks can successfully deceive LLMs with notable accuracy.
  • Empirical evidence showing high uncertainty levels in LLM responses during attacks, which can be leveraged to build a defense mechanism using machine learning classifiers to alert users of potentially manipulated responses.
  • Release of a novel factually adversarial dataset containing 3000 samples designed to benchmark and facilitate further research in adversarial vulnerabilities within LLMs.
  • High performance of Random Forest classifiers (up to ~96% AUC) in detecting attacked queries using uncertainty metrics, establishing a pathway towards user safety in LLM applications.

πŸ’‘ Why This Paper Matters

This paper is crucial as it addresses the significant vulnerability of LLMs to adversarial attacks, particularly in contexts where factual accuracy is paramount, such as in information retrieval and question-answering systems. By unveiling specific weaknesses and developing the Ο‡mera framework, the authors pave the way for future research aimed at enhancing the robustness and trustworthiness of AI systems, thus contributing to safer AI deployment in critical applications.

🎯 Why It's Interesting for AI Security Researchers

This research holds great interest for AI security researchers as it delineates a clear framework for understanding and evaluating adversarial threats in LLMs, a topic of growing concern with the increasing reliance on these models for critical tasks. The findings not only highlight existing vulnerabilities but also propose empirical methods for detection and mitigation, guiding future research and practical implementations aimed at strengthening AI security.

πŸ“š Read the Full Paper