← Back to Library

LLM Robustness Leaderboard v1 --Technical report

Authors: Pierre Peigné - Lefebvre, Quentin Feuillade-Montixi, Tom David, Nicolas Miailhe

Published: 2025-08-08

arXiv ID: 2508.06296v1

Added to Library: 2025-08-14 23:07 UTC

Red Teaming

📄 Abstract

This technical report accompanies the LLM robustness leaderboard published by PRISM Eval for the Paris AI Action Summit. We introduce PRISM Eval Behavior Elicitation Tool (BET), an AI system performing automated red-teaming through Dynamic Adversarial Optimization that achieves 100% Attack Success Rate (ASR) against 37 of 41 state-of-the-art LLMs. Beyond binary success metrics, we propose a fine-grained robustness metric estimating the average number of attempts required to elicit harmful behaviors, revealing that attack difficulty varies by over 300-fold across models despite universal vulnerability. We introduce primitive-level vulnerability analysis to identify which jailbreaking techniques are most effective for specific hazard categories. Our collaborative evaluation with trusted third parties from the AI Safety Network demonstrates practical pathways for distributed robustness assessment across the community.

🔍 Key Points

  • Introduction of the PRISM Eval Behavior Elicitation Tool (BET) that achieves a 100% Attack Success Rate (ASR) against 37 of 41 state-of-the-art LLMs through Dynamic Adversarial Optimization.
  • Development of a fine-grained robustness metric that estimates the average number of attempts needed to elicit harmful behaviors, revealing over a 300-fold variation in attack difficulty across models despite universal vulnerabilities.
  • Implementation of primitive-level vulnerability analysis to identify effective jailbreaking techniques for specific hazard categories, enhancing understanding of model vulnerabilities.
  • Demonstration of a collaborative evaluation approach involving trusted third parties, which establishes a framework for distributed AI robustness assessment and enhances the credibility of the evaluations conducted.
  • Identification of significant differences in robustness across models, challenging the assumption that model scaling and reasoning improvements inherently enhance adversarial resilience.

💡 Why This Paper Matters

This paper presents critical advancements in assessing the robustness of Large Language Models (LLMs) against adversarial threats. By proposing novel methodologies for evaluating LLMs, including automated red-teaming and fine-grained robustness metrics, it provides essential insights into the vulnerabilities of current AI systems. The findings underscore the pressing need for improved safety measures, making this research highly relevant in the ongoing discourse about AI safety and ethical AI deployment.

🎯 Why It's Interesting for AI Security Researchers

The paper is of great interest to AI security researchers as it addresses the critical challenge of ensuring adversarial robustness in AI systems. By introducing innovative tools and methodologies for evaluating model vulnerabilities, it lays the groundwork for future research in AI safety, facilitating the development of more resilient AI systems. The collaborative approach to evaluation also provides a model for shared responsibility in AI safety testing, thereby contributing to the establishment of effective governance and safety standards in AI applications.

📚 Read the Full Paper