Toward Honest Language Models for Deductive Reasoning

Authors: Jiarui Liu, Kaustubh Dhole, Yingheng Wang, Haoyang Wen, Sarah Zhang, Haitao Mao, Gaotang Li, Neeraj Varshney, Jingguo Liu, Xiaoman Pan

Published: 2025-11-12

arXiv ID: 2511.09222v3

Added to Library: 2025-11-26 03:00 UTC

📄 Abstract

Deductive reasoning is the process of deriving conclusions strictly from the given premises, without relying on external knowledge. We define honesty in this setting as a model's ability to respond only when the conclusion is logically entailed by the premises, and to abstain otherwise. However, current language models often fail to reason honestly, producing unwarranted answers when the input is insufficient. To study this challenge, we formulate honest deductive reasoning as multi-step tasks where models must either derive the correct conclusion or abstain. We curate two datasets from graph structures, one for linear algebra and one for logical inference, and introduce unanswerable cases by randomly perturbing an edge in half of the instances. We find that prompting and existing training methods, including GRPO with or without supervised fine-tuning initialization, struggle on these tasks. In particular, GRPO optimize only for final task outcomes, leaving models vulnerable to collapse when negative rewards dominate early training. To address this, we propose ACNCHOR, a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse. Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance, underscoring the importance of training dynamics for enabling honest deductive reasoning in language models.

🔍 Key Points

Introduction of a novel automated pipeline for generating psychologically-grounded multi-turn jailbreak datasets which produce 1,500 scenarios based on the Foot-in-the-Door principle.
Evaluation of seven models from three major LLM families under multi-turn and single-turn conditions revealing significant differences in contextual robustness, with GPT models showing higher vulnerabilities compared to Gemini 2.5 Flash and Claude 3 Haiku.
Establishment of a benchmark to measure Attack Success Rates (ASR) showing up to a 32% increase in vulnerability for GPT models when conversational history is included, suggesting the importance of context in model safety.
Detailed discussion of mitigation strategies including architectural changes, adversarial training, and detection mechanisms to bolster the robustness of LLMs against multi-turn conversational attacks.
Methodological validation of dataset generation with 98% agreement with human assessments, underpinning the reliability of the automated testing framework.

💡 Why This Paper Matters

This paper advances the understanding of vulnerabilities inherent in Large Language Models (LLMs) when subjected to multi-turn conversational attacks, which exploit psychological principles to bypass model safety mechanisms. By creating an automated pipeline for generating attacks and evaluating multiple LLMs, the authors not only highlight significant differences in model robustness but also propose practical solutions to enhance security. The findings emphasize the necessity of context-aware defenses in AI systems, which is fundamental for developing safer AI applications in sensitive environments.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper particularly relevant because it addresses a critical and emergent threat landscape in AI: the exploitation of conversational context to bypass safety measures. The automated generation of adversarial prompts based on psychological techniques provides an innovative method for evaluating model vulnerabilities at scale, which is essential for understanding and mitigating risks in AI deployments. Furthermore, the proposed defense mechanisms could inform future designs of safer LLM architectures, making the findings significant for researchers aiming to enhance the security and reliability of AI systems.

Toward Honest Language Models for Deductive Reasoning

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper