Toward Honest Language Models for Deductive Reasoning

Authors: Jiarui Liu, Kaustubh Dhole, Yingheng Wang, Haoyang Wen, Sarah Zhang, Haitao Mao, Gaotang Li, Neeraj Varshney, Jingguo Liu, Xiaoman Pan

Published: 2025-11-12

arXiv ID: 2511.09222v4

Added to Library: 2025-12-01 04:00 UTC

📄 Abstract

Deductive reasoning is the process of deriving conclusions strictly from the given premises, without relying on external knowledge. We define honesty in this setting as a model's ability to respond only when the conclusion is logically entailed by the premises, and to abstain otherwise. However, current language models often fail to reason honestly, producing unwarranted answers when the input is insufficient. To study this challenge, we formulate honest deductive reasoning as multi-step tasks where models must either derive the correct conclusion or abstain. We curate two datasets from graph structures, one for linear algebra and one for logical inference, and introduce unanswerable cases by randomly perturbing an edge in half of the instances. We find that prompting and existing training methods, including GRPO with or without supervised fine-tuning initialization, struggle on these tasks. In particular, GRPO optimize only for final task outcomes, leaving models vulnerable to collapse when negative rewards dominate early training. To address this, we propose ACNCHOR, a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse. Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance, underscoring the importance of training dynamics for enabling honest deductive reasoning in language models.

🔍 Key Points

Introduction of the Adversarial Confusion Attack, targeting the systematic disruption of multimodal large language models (MLLMs) by maximizing next-token entropy.
Demonstrated that a single adversarial image can cause a confusion effect across various models, highlighting the vulnerability of MLLMs to this type of attack.
Characterization of five distinct confusion modes experienced by models under attack, outlining the spectrum from blindness to complete semantic collapse.
Evaluation of transferability of the adversarial attack to both unseen open-source and proprietary models, showcasing the broad applicability of the method.
Discussion of practical implications, including the potential use of adversarial images embedded in websites to impede the functionality of MLLM-powered AI agents.

💡 Why This Paper Matters

This paper presents a novel threat to multimodal large language models, showing that adversarial techniques can be effectively utilized to create confusion in model outputs. The development of the Adversarial Confusion Attack is significant not only for understanding the vulnerabilities of current AI models but also for designing defenses against potential misuse, making it a crucial read for both AI developers and security researchers.

🎯 Why It's Interesting for AI Security Researchers

The findings in this paper are particularly relevant for AI security researchers as they expose critical vulnerabilities in multimodal large language models. Understanding these weaknesses can drive the development of stronger security protocols and preventative measures against misuse and adversarial attacks, which are increasingly relevant in a world where AI systems are integrated into numerous applications.

Toward Honest Language Models for Deductive Reasoning

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper