โ† Back to Library

BarrierSteer: LLM Safety via Learning Barrier Steering

Authors: Thanh Q. Tran, Arun Verma, Kiwan Wong, Bryan Kian Hsiang Low, Daniela Rus, Wei Xiao

Published: 2026-02-23

arXiv ID: 2602.20102v1

Added to Library: 2026-02-24 04:02 UTC

Safety

๐Ÿ“„ Abstract

Despite the state-of-the-art performance of large language models (LLMs) across diverse tasks, their susceptibility to adversarial attacks and unsafe content generation remains a major obstacle to deployment, particularly in high-stakes settings. Addressing this challenge requires safety mechanisms that are both practically effective and supported by rigorous theory. We introduce BarrierSteer, a novel framework that formalizes response safety by embedding learned non-linear safety constraints directly into the model's latent representation space. BarrierSteer employs a steering mechanism based on Control Barrier Functions (CBFs) to efficiently detect and prevent unsafe response trajectories during inference with high precision. By enforcing multiple safety constraints through efficient constraint merging, without modifying the underlying LLM parameters, BarrierSteer preserves the model's original capabilities and performance. We provide theoretical results establishing that applying CBFs in latent space offers a principled and computationally efficient approach to enforcing safety. Our experiments across multiple models and datasets show that BarrierSteer substantially reduces adversarial success rates, decreases unsafe generations, and outperforms existing methods.

๐Ÿ” Key Points

  • Introduction of BarrierSteer, a framework incorporating learned non-linear safety constraints into the latent space of large language models (LLMs) for enhanced safety during response generation.
  • Utilization of Control Barrier Functions (CBFs) to create a steering mechanism that guarantees real-time detection and prevention of unsafe trajectories, offering provable safety guarantees.
  • Performance evaluation shows that BarrierSteer significantly reduces adversarial success rates without compromising the utility of LLMs, outperforming existing methods, including heuristic alignment strategies.
  • Efficient computational methods for safety enforcement are proposed, enabling the framework to maintain original model parameters while integrating real-time safety mechanisms.

๐Ÿ’ก Why This Paper Matters

This paper presents a significant advancement in ensuring the safety of large language models by integrating rigorous safety mechanisms directly into the modelโ€™s operational framework, making it highly relevant for deployment in sensitive environments where adversarial attacks or harmful content generation pose substantial risks.

๐ŸŽฏ Why It's Interesting for AI Security Researchers

AI security researchers will find this paper vital as it tackles a central challenge in the deployment of LLMsโ€”ensuring safety and robustness against adversarial inputs. The novel methods and empirical results presented can guide future research and practical applications, pushing forward the field of secure AI.

๐Ÿ“š Read the Full Paper