← Back to Library

VERA: Variational Inference Framework for Jailbreaking Large Language Models

Authors: Anamika Lochab, Lu Yan, Patrick Pynadath, Xiangyu Zhang, Ruqi Zhang

Published: 2025-06-27

arXiv ID: 2506.22666v1

Added to Library: 2025-07-01 04:02 UTC

Red Teaming

📄 Abstract

The rise of API-only access to state-of-the-art LLMs highlights the need for effective black-box jailbreak methods to identify model vulnerabilities in real-world settings. Without a principled objective for gradient-based optimization, most existing approaches rely on genetic algorithms, which are limited by their initialization and dependence on manually curated prompt pools. Furthermore, these methods require individual optimization for each prompt, failing to provide a comprehensive characterization of model vulnerabilities. To address this gap, we introduce VERA: Variational infErence fRamework for jAilbreaking. VERA casts black-box jailbreak prompting as a variational inference problem, training a small attacker LLM to approximate the target LLM's posterior over adversarial prompts. Once trained, the attacker can generate diverse, fluent jailbreak prompts for a target query without re-optimization. Experimental results show that VERA achieves strong performance across a range of target LLMs, highlighting the value of probabilistic inference for adversarial prompt generation.

🔍 Key Points

  • Introduction of VERA as a variational inference framework for black-box jailbreaking of large language models (LLMs), which enables efficient adversarial prompt generation without manual curation.
  • VERA's novel approach allows for generating diverse jailbreak prompts by learning a distribution over adversarial prompts, deviating from traditional methods reliant on genetic algorithms and individual prompt optimization.
  • Experimental results demonstrate that VERA achieves state-of-the-art attack success rates across a variety of LLMs, including both open-source and commercial models, showcasing its efficacy and generalizability.
  • VERA's prompts show strong transferability across different models, detecting vulnerabilities in LLMs consistently, which is critical for comprehensive adversarial testing.
  • The method exhibits resilience against common model defenses, maintaining effectiveness in generating harmful outputs even in the presence of safety measures.

💡 Why This Paper Matters

This paper is significant because it presents a robust and scalable method for testing the vulnerabilities of large language models through automated adversarial prompt generation. By showcasing a variational approach, VERA enhances our understanding of how to efficiently identify LLM weaknesses and the limitations of current safety measures. This research is crucial for pushing the boundaries of AI safety efforts, as it highlights potential gaps in LLM defenses that need further attention and improvement.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly intriguing for AI security researchers due to its implications for understanding and mitigating adversarial attacks on LLMs. Researchers will find VERA's innovative methodology of probabilistic inference in prompt generation valuable as it addresses existing challenges in scalability, diversity, and defense against increasingly sophisticated models. Moreover, it offers insights into the evolving landscape of model vulnerabilities and safety mechanisms, providing a foundation for future research aimed at creating more resilient AI systems.

📚 Read the Full Paper