← Back to Library

Reasoning Up the Instruction Ladder for Controllable Language Models

Authors: Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar

Published: 2025-10-30

arXiv ID: 2511.04694v2

Added to Library: 2025-11-14 23:06 UTC

Red Teaming

📄 Abstract

As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first "think" about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises both aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization. Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks. This reasoning ability also generalizes to safety-critical settings beyond the training distribution. By treating safety issues as resolving conflicts between adversarial user inputs and predefined higher-priority policies, our trained model enhances robustness against jailbreak and prompt injection attacks. These results demonstrate that reasoning over instruction hierarchies provides a practical path to reliable LLMs, where updates to system prompts yield controllable and robust changes in model behavior.

🔍 Key Points

  • The paper introduces the concept of instruction hierarchy (IH) in large language models (LLMs), emphasizing the importance of resolving conflicting instructions based on priority.
  • A novel dataset called VerIH is created for training LLMs in reasoning about instruction hierarchies, incorporating both aligned and conflicting user-system prompt pairs.
  • The authors implement a lightweight reinforcement learning approach to enhance the model's reasoning capabilities, leading to improved instruction following and hierarchy compliance in diverse scenarios.
  • Experimental results demonstrate that the proposed framework not only improves performance on instruction-related tasks but also enhances model robustness against adversarial attacks such as jailbreak and prompt injection.
  • The research indicates that prioritizing high-level instructions can dynamically adjust model behavior, potentially increasing the flexibility and reliability of LLM applications in safety-critical contexts.

💡 Why This Paper Matters

This paper is relevant as it contributes significantly to the understanding of how LLMs can effectively navigate and prioritize conflicting instructions, a crucial requirement for AI systems that operate in safety-sensitive environments. By formulating instruction hierarchy resolution as a reasoning task and providing a structured dataset for training, the authors advance the field towards creating more reliable and controllable language models. Furthermore, the practical implications of this work are substantial, as they pave the way for the deployment of LLMs in real-world applications where robustness and adherence to specific directives are paramount.

🎯 Why It's Interesting for AI Security Researchers

This paper is of interest to AI security researchers due to its focus on enhancing the robustness of LLMs against adversarial threats. By framing the resolution of conflicting prompts as a reasoning challenge, the findings offer novel insights into preventing harmful behaviors triggered by malicious inputs. The results demonstrate a clear potential for improving safety mechanisms through the introduction of explicit instruction prioritization, which could be critical in enhancing the resilience of LLMs in the face of sophisticated attacks, thereby ensuring safer interactions between users and AI systems.

📚 Read the Full Paper