Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

📄 Abstract

Traditional benchmarks for large language models (LLMs) primarily assess safety risk through breadth-oriented evaluation across diverse tasks. However, real-world deployment exposes a different class of risk: operational failures arising from repeated inference on identical or near-identical prompts rather than broad task generalization. In high-stakes settings, response consistency and safety under sustained use are critical. We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by reliability engineering. APST repeatedly samples identical prompts under controlled operational conditions (e.g., decoding temperature) to surface latent failure modes including hallucinations, refusal inconsistency, and unsafe completions. Rather than treating failures as isolated events, APST models them as stochastic outcomes of independent inference events. We formalize safety failures using Bernoulli and binomial models to estimate per-inference failure probabilities, enabling quantitative comparison of reliability across models and decoding configurations. Applying APST to multiple instruction-tuned LLMs evaluated on AIR-BENCH-derived safety prompts, we find that models with similar benchmark-aligned scores can exhibit substantially different empirical failure rates under repeated sampling, particularly as temperature increases. These results demonstrate that shallow, single-sample evaluation can obscure meaningful reliability differences under sustained use. APST complements existing benchmarks by providing a practical framework for evaluating LLM safety and reliability under repeated inference, bridging benchmark alignment and deployment-oriented risk assessment.

🔍 Key Points

Introduction of Accelerated Prompt Stress Testing (APST) as a depth-oriented evaluation framework for assessing LLM safety under repeated inference.
APST models safety failures as stochastic processes, enabling quantifiable comparison of reliability across models and temperature settings.
Demonstrated that shallow, single-sample evaluations can obscure significant reliability differences, with empirical results showing varying failure rates among models despite similar benchmark scores.
Provided insights into how empirical failure probabilities increase with temperature, revealing latent failure modes that are critical for deployment reliability assessment.
Outlined the operational risk associated with LLM deployment and the importance of considering inference-level reliability in safety evaluations.

💡 Why This Paper Matters

This paper is pivotal in advancing LLM safety evaluations by introducing a novel testing methodology that emphasizes the reliability of responses under sustained use rather than isolated performance metrics. The findings underscore the necessity for more robust evaluation frameworks that align closely with real-world deployment scenarios, marking a critical step towards safer AI applications in high-stakes environments.

🎯 Why It's Interesting for AI Security Researchers

The research presented in this paper is particularly relevant to AI security researchers as it addresses the critical challenge of ensuring large language models operate reliably in real-world applications. By revealing how traditional evaluations may overlook concerning behaviors under repeated use, this work provides a foundational framework for enhancing the safety and dependability of AI systems. This insight is vital for developing effective monitoring and mitigation strategies against potential risks posed by LLMs, especially in security-sensitive domains.

Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper