← Back to Library

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

Authors: Mingqian Feng, Xiaodong Liu, Weiwei Yang, Chenliang Xu, Christopher White, Jianfeng Gao

Published: 2026-01-30

arXiv ID: 2601.22636v1

Added to Library: 2026-02-03 08:07 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting, which underestimates real-world risk. In practice, attackers can exploit large-scale parallel sampling to repeatedly probe a model until a harmful response is produced. While recent work shows that attack success increases with repeated sampling, principled methods for predicting large-scale adversarial risk remain limited. We propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling. We model sample-level success probabilities using a Beta distribution, the conjugate prior of the Bernoulli distribution, and derive an analytic scaling law that enables reliable extrapolation of large-N attack success rates from small-budget measurements. Using only n=100 samples, our anchored estimator predicts ASR@1000 with a mean absolute error of 1.66, compared to 12.04 for the baseline, which is an 86.2% reduction in estimation error. Our results reveal heterogeneous risk scaling profiles and show that models appearing robust under standard evaluation can experience rapid nonlinear risk amplification under parallel adversarial pressure. This work provides a low-cost, scalable methodology for realistic LLM safety assessment. We will release our code and evaluation scripts upon publication to future research.

🔍 Key Points

  • Proposes the scaling-aware Best-of-N estimation of risk (SABER) for assessing adversarial risks in Large Language Models (LLMs) under realistic conditions.
  • Models sample-level success probabilities using a Beta distribution, providing a statistically sound way to predict adversarial success rates across varying budgets.
  • Demonstrates that traditional single-shot evaluations can significantly underestimate the adversarial risk in practical scenarios, leading to potentially dangerous outcomes.
  • Achieves a significant reduction in estimation error (86.2%) compared to baseline methods, showcasing the effectiveness of the SABER approach using only a small number of samples (n=100).
  • Reveals heterogeneous risk scaling profiles, indicating that certain models thought to be robust can exhibit rapid risk amplification under parallel adversarial pressure.

💡 Why This Paper Matters

This paper provides a critical new framework for evaluating the safety and robustness of Large Language Models against adversarial attacks. Its novel scaling-aware approach enables more accurate predictions of adversarial risk in real-world scenarios, which is essential as LLMs become increasingly integrated into applications with significant social impact. By addressing the limitations of existing evaluation methodologies, this research lays the groundwork for more reliable assessments of LLM safety and contributes to the development of robust AI systems for future applications.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly interesting as it tackles the pressing issue of adversarial attacks on AI systems, specifically Large Language Models. The introduction of the SABER framework represents a significant advancement in understanding and forecasting adversarial risks, which is crucial for enhancing the safety measures implemented in AI deployment. By providing concrete methods and empirical results, this research equips security professionals with novel techniques to proactively evaluate and mitigate risks associated with evolving adversarial tactics.

📚 Read the Full Paper