← Back to Library

The Straight and Narrow: Do LLMs Possess an Internal Moral Path?

Authors: Luoming Hu, Jingjie Zeng, Liang Yang, Hongfei Lin

Published: 2026-01-15

arXiv ID: 2601.10307v1

Added to Library: 2026-01-16 03:02 UTC

📄 Abstract

Enhancing the moral alignment of Large Language Models (LLMs) is a critical challenge in AI safety. Current alignment techniques often act as superficial guardrails, leaving the intrinsic moral representations of LLMs largely untouched. In this paper, we bridge this gap by leveraging Moral Foundations Theory (MFT) to map and manipulate the fine-grained moral landscape of LLMs. Through cross-lingual linear probing, we validate the shared nature of moral representations in middle layers and uncover a shared yet different moral subspace between English and Chinese. Building upon this, we extract steerable Moral Vectors and successfully validate their efficacy at both internal and behavioral levels. Leveraging the high generalizability of morality, we propose Adaptive Moral Fusion (AMF), a dynamic inference-time intervention that synergizes probe detection with vector injection to tackle the safety-helpfulness trade-off. Empirical results confirm that our approach acts as a targeted intrinsic defense, effectively reducing incorrect refusals on benign queries while minimizing jailbreak success rates compared to standard baselines.

🔍 Key Points

  • Conducted the first large-scale empirical analysis of vulnerabilities in agent skills, analyzing 31,132 skills using a multi-stage detection framework called SkillScan.
  • Found that 26.1% of agent skills contain at least one vulnerability, with data exfiltration (13.3%) and privilege escalation (11.8%) being the most prevalent security risks.
  • Developed a comprehensive vulnerability taxonomy consisting of 14 distinct patterns across four categories: prompt injection, data exfiltration, privilege escalation, and supply chain risks.
  • Created a validated detection methodology that achieved 86.7% precision and 82.5% recall, making it a significant tool for identifying security vulnerabilities in agent skills.
  • Released an open dataset and detection toolkit for future research, emphasizing the need for better security vetting and capability-based permission systems.

💡 Why This Paper Matters

This paper addresses critical gaps in the understanding of security vulnerabilities inherent in AI agent skill ecosystems. By providing empirical findings, a robust detection framework, and a comprehensive vulnerability taxonomy, it exemplifies the urgent need for heightened security protocols in the development and deployment of these skills, thus laying a foundation for safer AI interactions.

🎯 Why It's Interesting for AI Security Researchers

The study's findings are crucial for AI security researchers as they shed light on the overlooked vulnerabilities in AI agent frameworks, prompting considerations for better security practices. The novel detection methods and broad taxonomy will aid in future research efforts aiming to enhance the security of AI systems and mitigate exploitation risks.

📚 Read the Full Paper