← Back to Library

Efficient LLM Safety Evaluation through Multi-Agent Debate

Authors: Dachuan Lin, Guobin Shen, Zihao Yang, Tianrong Liu, Dongcheng Zhao, Yi Zeng

Published: 2025-11-09

arXiv ID: 2511.06396v1

Added to Library: 2025-11-11 05:02 UTC

Red Teaming Safety

📄 Abstract

Safety evaluation of large language models (LLMs) increasingly relies on LLM-as-a-Judge frameworks, but the high cost of frontier models limits scalability. We propose a cost-efficient multi-agent judging framework that employs Small Language Models (SLMs) through structured debates among critic, defender, and judge agents. To rigorously assess safety judgments, we construct HAJailBench, a large-scale human-annotated jailbreak benchmark comprising 12,000 adversarial interactions across diverse attack methods and target models. The dataset provides fine-grained, expert-labeled ground truth for evaluating both safety robustness and judge reliability. Our SLM-based framework achieves agreement comparable to GPT-4o judges on HAJailBench while substantially reducing inference cost. Ablation results show that three rounds of debate yield the optimal balance between accuracy and efficiency. These findings demonstrate that structured, value-aligned debate enables SLMs to capture semantic nuances of jailbreak attacks and that HAJailBench offers a reliable foundation for scalable LLM safety evaluation.

🔍 Key Points

  • Introduced HAJailBench, a comprehensive human-annotated jailbreak benchmark with 12,000 instances designed for evaluating the safety robustness of LLMs under diverse adversarial conditions.
  • Proposed a cost-efficient Multi-Agent Judge framework utilizing structured debates among Small Language Models (SLMs) to simulate adversarial reasoning, improving both safety evaluation accuracy and reducing inference costs.
  • Demonstrated that three rounds of structured debate balance accuracy and efficiency, achieving agreement levels comparable to larger, more expensive models (like GPT-4o) while reducing computational expenses by approximately 43%.
  • Provided extensive experimental results showing superior performance of the Multi-Agent Judge compared to traditional evaluation methods, emphasizing its effectiveness in capturing nuanced semantic intent in jailbreak attacks.
  • Outlined limitations and future work directions, emphasizing the need for dynamic learning and human-in-the-loop feedback to enhance the robustness of safety evaluations.

💡 Why This Paper Matters

This paper is significant as it presents a novel, structured approach to evaluating the safety of large language models, addressing critical issues of scalability and cost-effectiveness in the context of AI safety. By using a debate framework among smaller models, it shows that high-quality safety assessments can be achieved without relying on prohibitively expensive models, thus expanding the accessibility of robust safety evaluation methods in AI deployment.

🎯 Why It's Interesting for AI Security Researchers

This research is of particular interest to AI security researchers because it tackles the pressing issue of jailbreak attacks that exploit vulnerabilities in LLMs. The introduction of the HAJailBench dataset and the Multi-Agent Judge framework can help in the development of more resilient LLMs and in understanding the mechanisms behind adversarial attacks. Additionally, the findings can inform the design of future AI systems and assessment protocols, making it crucial for advancing AI safety and governance.

📚 Read the Full Paper