← Back to Library

Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control

Authors: Yonghui Yang, Wenjian Tao, Jilong Liu, Xingyu Zhu, Junfeng Fang, Weibiao Huang, Le Wu, Richang Hong, Tat-Sent Chua

Published: 2026-02-07

arXiv ID: 2602.07340v1

Added to Library: 2026-02-10 03:04 UTC

Safety

📄 Abstract

Safety alignment of large language models remains brittle under domain shift and noisy preference supervision. Most existing robust alignment methods focus on uncertainty in alignment data, while overlooking optimization-induced fragility in preference-based objectives. In this work, we revisit robustness for LLM safety alignment from an optimization geometry perspective, and argue that robustness failures cannot be addressed by data-centric methods alone. We propose ShaPO, a geometry-aware preference optimization framework that enforces worst-case alignment objectives via selective geometry control over alignment-critical parameter subspace. By avoiding uniform geometry constraints, ShaPO mitigates the over-regularization that can harm robustness under distribution shift. We instantiate ShaPO at two levels: token-level ShaPO stabilizes likelihood-based surrogate optimization, while reward-level ShaPO enforces reward-consistent optimization under noisy supervision. Across diverse safety benchmarks and noisy preference settings, ShaPO consistently improves safety robustness over popular preference optimization methods. Moreover, ShaPO composes cleanly with data-robust objectives, yielding additional gains and empirically supporting the proposed optimization-geometry perspective.

🔍 Key Points

  • Introduction of ShaPO (Sharpness-aware Preference Optimization), a geometry-aware preference optimization framework designed to enhance the safety alignment of LLMs.
  • Emphasis on selective geometry control to optimize only the safety-critical parameter subspace, mitigating the negative effects of uniform geometry constraints.
  • Demonstration of ShaPO's effectiveness across diverse safety benchmarks, showing significant improvements in robustness against domain shifts and noisy preference supervision compared to standard algorithms like DPO and its variants.
  • Establishment of a composability aspect, illustrating how ShaPO can be integrated with existing data-centric methods for improved overall robustness without degrading performance.
  • Presentation of empirical results supporting the idea that robustness in LLM safety alignment should consider both data uncertainty and optimization geometry.

💡 Why This Paper Matters

This paper addresses a critical gap in the safety alignment of large language models by exploring the interplay between optimization geometry and robustness. The introduction of ShaPO not only provides a novel methodological framework but also demonstrates significant improvements in aligning AI behavior with safety standards, particularly under challenging conditions. As AI continues to infiltrate various domains, ensuring the reliability and safety of LLM responses becomes paramount, making this research particularly relevant and timely.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper important because it expands on existing robust alignment methods by introducing an innovative approach that accounts for optimization-induced fragility. The findings can inform the design of more resilient AI systems capable of resisting adversarial attacks and domain shifts, thereby enhancing the overall security and reliability of LLMs in real-world applications.

📚 Read the Full Paper