← Back to Library

Death by a Thousand Prompts: Open Model Vulnerability Analysis

Authors: Amy Chang, Nicholas Conley, Harish Santhanalakshmi Ganesan, Adam Swanda

Published: 2025-11-05

arXiv ID: 2511.03247v1

Added to Library: 2025-11-06 05:01 UTC

Red Teaming

📄 Abstract

Open-weight models provide researchers and developers with accessible foundations for diverse downstream applications. We tested the safety and security postures of eight open-weight large language models (LLMs) to identify vulnerabilities that may impact subsequent fine-tuning and deployment. Using automated adversarial testing, we measured each model's resilience against single-turn and multi-turn prompt injection and jailbreak attacks. Our findings reveal pervasive vulnerabilities across all tested models, with multi-turn attacks achieving success rates between 25.86\% and 92.78\% -- representing a $2\times$ to $10\times$ increase over single-turn baselines. These results underscore a systemic inability of current open-weight models to maintain safety guardrails across extended interactions. We assess that alignment strategies and lab priorities significantly influence resilience: capability-focused models such as Llama 3.3 and Qwen 3 demonstrate higher multi-turn susceptibility, whereas safety-oriented designs such as Google Gemma 3 exhibit more balanced performance. The analysis concludes that open-weight models, while crucial for innovation, pose tangible operational and ethical risks when deployed without layered security controls. These findings are intended to inform practitioners and developers of the potential risks and the value of professional AI security solutions to mitigate exposure. Addressing multi-turn vulnerabilities is essential to ensure the safe, reliable, and responsible deployment of open-weight LLMs in enterprise and public domains. We recommend adopting a security-first design philosophy and layered protections to ensure resilient deployments of open-weight models.

🔍 Key Points

  • The study reveals widespread vulnerabilities in the safety and security of open-weight LLMs, especially in multi-turn prompt interactions, with success rates on multi-turn attacks being 2x to 10x higher than single-turn attacks.
  • A comparative analysis of eight open-weight models showed how alignment strategies influence security gaps: capability-first models like Llama and Qwen demonstrated greater vulnerabilities compared to safety-oriented designs like Google Gemma.
  • The research utilized automated adversarial testing to evaluate models against various attack techniques, emphasizing the need for robust measures to address identified weaknesses and enhance model resilience.
  • Findings emphasize the importance of adopting a security-first design philosophy, recommending layered protections and advanced AI security solutions to mitigate risks associated with deploying open-weight models.
  • The study highlights the need for continuous monitoring and regular red-teaming exercises to ensure that models can withstand evolving threats and maintain operational integrity.

💡 Why This Paper Matters

This paper is critically important as it highlights the significant vulnerabilities inherent in open-weight large language models and the increased exposure to adversarial attacks, particularly in multi-turn scenarios. Its findings serve as a wake-up call for developers and organizations deploying these models, emphasizing the necessity for robust security measures to prevent misuse and protect sensitive information.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of particular interest to AI security researchers due to its comprehensive analysis of adversarial vulnerabilities within prominent open-weight models. It provides valuable insights into the relationship between model design, alignment strategies, and their implications for security. The findings can directly inform the development of improved defenses and contribute to the ongoing discussions about responsible AI deployment and safe practices within the research community.

📚 Read the Full Paper