← Back to Library

CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

Authors: Umid Suleymanov, Rufiz Bayramov, Suad Gafarli, Seljan Musayeva, Taghi Mammadov, Aynur Akhundlu, Murat Kantarcioglu

Published: 2026-02-26

arXiv ID: 2602.22557v1

Added to Library: 2026-02-27 03:02 UTC

Safety

📄 Abstract

Current safety mechanisms for Large Language Models (LLMs) rely heavily on static, fine-tuned classifiers that suffer from adaptation rigidity, the inability to enforce new governance rules without expensive retraining. To address this, we introduce CourtGuard, a retrieval-augmented multi-agent framework that reimagines safety evaluation as Evidentiary Debate. By orchestrating an adversarial debate grounded in external policy documents, CourtGuard achieves state-of-the-art performance across 7 safety benchmarks, outperforming dedicated policy-following baselines without fine-tuning. Beyond standard metrics, we highlight two critical capabilities: (1) Zero-Shot Adaptability, where our framework successfully generalized to an out-of-domain Wikipedia Vandalism task (achieving 90\% accuracy) by swapping the reference policy; and (2) Automated Data Curation and Auditing, where we leveraged CourtGuard to curate and audit nine novel datasets of sophisticated adversarial attacks. Our results demonstrate that decoupling safety logic from model weights offers a robust, interpretable, and adaptable path for meeting current and future regulatory requirements in AI governance.

🔍 Key Points

  • Introduction of CourtGuard, a retrieval-augmented multi-agent framework that implements safety evaluation as evidentiary debate, allowing for model-agnostic policy adherence without fine-tuning.
  • Demonstrates zero-shot adaptability by effectively transitioning to new policy domains (such as Wikipedia Vandalism detection) without requiring model retraining, achieving high accuracy (90%).
  • Automated data curation and auditing capabilities, improving the quality of existing datasets by identifying label noise and assisting human annotators with sophisticated adversarial attacks.
  • Achieves state-of-the-art performance across multiple safety benchmarks, outperforming static baselines and sophisticated judge frameworks by providing interpretable evidence-based reasoning for safety evaluations.
  • Decoupling of safety logic from model weights, enabling broader compatibility with various model architectures and reducing vendor lock-in issues.

💡 Why This Paper Matters

CourtGuard represents a significant advancement in the field of Large Language Model safety, merging adaptability, robust performance, and interpretability in a framework that provides a response to the challenges posed by traditional static safety models. Its ability to dynamically incorporate new safety regulations and adapt to various contexts without retraining positions it as a pivotal tool for enhancing AI governance frameworks.

🎯 Why It's Interesting for AI Security Researchers

This paper is crucial for AI security researchers as it addresses the pressing need for adaptable and interpretable safety mechanisms in the face of evolving regulatory landscapes and sophisticated adversarial attacks. The innovative approach of adversarial debate for safety assessment sets a new standard, ensuring that AI systems can comply with diverse policy requirements while effectively mitigating risks associated with harmful outputs.

📚 Read the Full Paper