← Back to Library

Reasoning Up the Instruction Ladder for Controllable Language Models

Authors: Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar

Published: 2025-10-30

arXiv ID: 2511.04694v3

Added to Library: 2025-12-03 03:01 UTC

Red Teaming

📄 Abstract

As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first "think" about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises ~7K aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization. Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup. This reasoning ability also generalizes to safety-critical settings beyond the training distribution. By treating safety issues as resolving conflicts between adversarial user inputs and predefined higher-priority policies, our trained model enhances robustness against jailbreak and prompt injection attacks, providing up to a 20% reduction in attack success rate (ASR). These results demonstrate that reasoning over instruction hierarchies provides a practical path to reliable LLMs, where updates to system prompts yield controllable and robust changes in model behavior.

🔍 Key Points

  • The paper proposes a novel method for managing competing instructions in Large Language Models (LLMs) by establishing an instruction hierarchy (IH) where higher-priority inputs take precedence over user prompts.
  • The introduction of the VerIH dataset, which consists of approximately 7K constraint-following tasks, allows for training models specifically on instruction hierarchy reasoning, contributing to more robust instruction compliance.
  • The methodology includes lightweight reinforcement learning to enhance the model's reasoning capabilities, resulting in significant improvements (around 20%) in instruction following and hierarchy benchmarks such as IHEval.
  • Tested models demonstrated greater resilience to safety threats like jailbreak and prompt injection attacks, showing a 20% reduction in attack success rates (ASR) under conditions that typically challenge model security.
  • Findings indicate that explicit reasoning over instruction hierarchies not only enhances controllability but also generalizes to out-of-distribution safety settings without requiring additional safety training.

💡 Why This Paper Matters

This paper presents a vital leap in the development of controllable and reliable LLMs by addressing the complex challenge of reconciling conflicting instructions. By framing instruction hierarchy as a reasoning task and developing a targeted dataset and training approach, it establishes a robust foundation for future AI models to interact in safety-critical contexts effectively. The insights gained are crucial in moving towards safer, more dependable AI applications.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly relevant as it tackles the significant issue of instruction compliance and robustness in LLMs, an area of rising concern given the increasing deployment of AI systems in sensitive applications. The research provides innovative methods for improving model behavior in adversarial settings, which is essential for enhancing the overall safety and security of AI technologies.

📚 Read the Full Paper