← Back to Library

The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs

Authors: Omar Mahmoud, Ali Khalil, Buddhika Laknath Semage, Thommen George Karimpanal, Santu Rana

Published: 2025-10-09

arXiv ID: 2510.07775v1

Added to Library: 2025-10-10 04:01 UTC

Safety

📄 Abstract

Hallucination in large language models (LLMs) has been widely studied in recent years, with progress in both detection and mitigation aimed at improving truthfulness. Yet, a critical side effect remains largely overlooked: enhancing truthfulness can negatively impact safety alignment. In this paper, we investigate this trade-off and show that increasing factual accuracy often comes at the cost of weakened refusal behavior. Our analysis reveals that this arises from overlapping components in the model that simultaneously encode hallucination and refusal information, leading alignment methods to suppress factual knowledge unintentionally. We further examine how fine-tuning on benign datasets, even when curated for safety, can degrade alignment for the same reason. To address this, we propose a method that disentangles refusal-related features from hallucination features using sparse autoencoders, and preserves refusal behavior during fine-tuning through subspace orthogonalization. This approach prevents hallucinations from increasing while maintaining safety alignment.We evaluate our method on commonsense reasoning tasks and harmful benchmarks (AdvBench and StrongReject). Results demonstrate that our approach preserves refusal behavior and task utility, mitigating the trade-off between truthfulness and safety.

🔍 Key Points

  • The paper identifies a critical trade-off in AI alignment between truthfulness and safety in LLMs, showing that improving factual accuracy can inadvertently weaken refusal behavior.
  • It highlights that hallucination and refusal signals share overlapping internal representations within LLM architectures, leading to unintended consequences during fine-tuning.
  • The authors propose a novel method utilizing sparse autoencoders to disentangle hallucination from refusal-related features, preserving safety alignment while improving truthfulness.
  • Experimental results demonstrate that their approach effectively mitigates the trade-off between accuracy and safety without degrading task performance across various benchmarks.
  • The findings emphasize the need for balance in alignment methods to prevent knowledge suppression and enhance the overall robustness of LLMs.

💡 Why This Paper Matters

This paper is of significant relevance as it uncovers and addresses a nuanced issue in AI alignment that has implications for the safety and reliability of language models. By exploring the unintended trade-offs in truthfulness and alignment, the work not only adds depth to the understanding of LLM behaviors but also offers practical methodologies to enhance model safety without compromising performance.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper is particularly interesting as it delves into the vulnerabilities of LLMs related to hallucination and refusal behavior. The insights into how alignment methods can inadvertently compromise safety mechanisms are critical for developing more secure AI systems. Furthermore, the proposed methods for preserving refusal features during fine-tuning allow for better protection against adversarial attacks, which is a core concern in AI security.

📚 Read the Full Paper