Toward Honest Language Models for Deductive Reasoning

Authors: Jiarui Liu, Kaustubh Dhole, Yingheng Wang, Haoyang Wen, Sarah Zhang, Haitao Mao, Gaotang Li, Neeraj Varshney, Jingguo Liu, Xiaoman Pan

Published: 2025-11-12

arXiv ID: 2511.09222v2

Added to Library: 2025-11-25 03:01 UTC

📄 Abstract

Deductive reasoning is the process of deriving conclusions strictly from the given premises, without relying on external knowledge. We define honesty in this setting as a model's ability to respond only when the conclusion is logically entailed by the premises, and to abstain otherwise. However, current language models often fail to reason honestly, producing unwarranted answers when the input is insufficient. To study this challenge, we formulate honest deductive reasoning as multi-step tasks where models must either derive the correct conclusion or abstain. We curate two datasets from graph structures, one for linear algebra and one for logical inference, and introduce unanswerable cases by randomly perturbing an edge in half of the instances. We find that prompting and existing training methods, including GRPO with or without supervised fine-tuning initialization, struggle on these tasks. In particular, GRPO optimize only for final task outcomes, leaving models vulnerable to collapse when negative rewards dominate early training. To address this, we propose \methodname{}, a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse. Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance, underscoring the importance of training dynamics for enabling honest deductive reasoning in language models.

🔍 Key Points

The paper presents a comparative analysis of jailbreak vulnerabilities in two popular LLMs, Google's Gemini 2.5 Flash and OpenAI's GPT-4o mini, revealing different susceptibilities to automated adversarial attacks.
Four attack vectors (Direct Injection, Role-Playing, Context Manipulation, Obfuscation) were used, with Context Manipulation identified as the most potent, demonstrating a significant challenge in LLM safety mechanisms.
The study introduces two bypass strategies, 'self-bypass' and 'cross-bypass', allowing LLMs to generate adversarial prompts against themselves and each other, offering a scalable framework for automated testing.
Findings indicate that Gemini is generally more resilient than GPT-4 against jailbreak attacks, highlighting variations in safety implementations across LLM architectures.
The research validates the effectiveness of using LLMs for red-teaming under the self-bypass method, suggesting cost-effective and efficient ways to improve AI safety by enabling models to identify and expose their own vulnerabilities.

💡 Why This Paper Matters

This paper is crucial in the ongoing exploration of safety vulnerabilities in large language models, emphasizing the need for robust defenses against adversarial attacks. By systematically comparing two prominent models, the study sheds light on their differing safety architectures and introduces novel methods for evaluating their resilience, thus contributing valuable insights to the field of AI security.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper of great interest as it addresses pressing concerns regarding the safety and ethical implications of LLMs. The innovative methodologies and empirical findings not only enhance the understanding of adversarial vulnerabilities but also propose forward-thinking strategies for improving LLM safety protocols in real-world applications.

Toward Honest Language Models for Deductive Reasoning

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper