← Back to Library

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Authors: Harry Owiredu-Ashley

Published: 2026-03-10

arXiv ID: 2603.10068v1

Added to Library: 2026-03-12 02:03 UTC

Red Teaming

📄 Abstract

Most adversarial evaluations of large language model (LLM) safety assess single prompts and report binary pass/fail outcomes, which fails to capture how safety properties evolve under sustained adversarial interaction. We present ADVERSA, an automated red-teaming framework that measures guardrail degradation dynamics as continuous per-round compliance trajectories rather than discrete jailbreak events. ADVERSA uses a fine-tuned 70B attacker model (ADVERSA-Red, Llama-3.1-70B-Instruct with QLoRA) that eliminates the attacker-side safety refusals that render off-the-shelf models unreliable as attackers, scoring victim responses on a structured 5-point rubric that treats partial compliance as a distinct measurable state. We report a controlled experiment across three frontier victim models (Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.2) using a triple-judge consensus architecture in which judge reliability is measured as a first-class research outcome rather than assumed. Across 15 conversations of up to 10 adversarial rounds, we observe a 26.7% jailbreak rate with an average jailbreak round of 1.25, suggesting that in this evaluation setting, successful jailbreaks were concentrated in early rounds rather than accumulating through sustained pressure. We document inter-judge agreement rates, self-judge scoring tendencies, attacker drift as a failure mode in fine-tuned attackers deployed out of their training distribution, and attacker refusals as a previously-underreported confound in victim resistance measurement. All limitations are stated explicitly. Attack prompts are withheld per responsible disclosure policy; all other experimental artifacts are released.

🔍 Key Points

  • ADVERSA introduces a dynamic evaluation framework for measuring LLM safety under multi-turn adversarial pressure, moving beyond binary pass/fail metrics to continuous compliance trajectories.
  • The study highlights the significant impact of initial framing strategies in adversarial interactions, with a 26.7% jailbreak rate concentrated mainly in early rounds, indicating that framing may be more critical than iterative pressure.
  • It employs a triple-judge consensus mechanism to assess compliance, revealing the importance and variability of judge reliability in evaluating LLM safety, and provides robust metrics for this assessment.
  • ADVERSA documents attacker drift as a failure mode where fine-tuned attackers can lose focus on their objectives in multi-turn settings, raising concerns about the reliability of attacker models used in evaluations.
  • The paper outlines ethical considerations and responsible disclosure practices, ensuring that potentially harmful findings are communicated responsibly.

💡 Why This Paper Matters

This paper is a significant step in advancing the evaluation of large language models by addressing limitations in current security assessments. Its novel metrics and methodologies allow for a deeper understanding of how adversarial interactions can exploit LLM weaknesses over time, offering a more realistic view of their safety under sustained attack. This research provides tools and insights crucial for developing more resilient AI systems.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper is relevant as it provides innovative frameworks for evaluating LLMs under adversarial conditions. It sheds light on dynamics previously overlooked in assessments of AI safety and reliability, thus informing future work on enhancing model robustness. Moreover, the findings regarding judge reliability and attacker drift serve as critical considerations for designing evaluation protocols, making this research a valuable resource for improving AI safety practices.

📚 Read the Full Paper