← Back to Library

TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts

Authors: Torsten Krauß, Hamid Dashtbani, Alexandra Dmitrienko

Published: 2025-06-09

arXiv ID: 2506.07596v1

Added to Library: 2025-06-10 04:01 UTC

Red Teaming

πŸ“„ Abstract

Machine learning is advancing rapidly, with applications bringing notable benefits, such as improvements in translation and code generation. Models like ChatGPT, powered by Large Language Models (LLMs), are increasingly integrated into daily life. However, alongside these benefits, LLMs also introduce social risks. Malicious users can exploit LLMs by submitting harmful prompts, such as requesting instructions for illegal activities. To mitigate this, models often include a security mechanism that automatically rejects such harmful prompts. However, they can be bypassed through LLM jailbreaks. Current jailbreaks often require significant manual effort, high computational costs, or result in excessive model modifications that may degrade regular utility. We introduce TwinBreak, an innovative safety alignment removal method. Building on the idea that the safety mechanism operates like an embedded backdoor, TwinBreak identifies and prunes parameters responsible for this functionality. By focusing on the most relevant model layers, TwinBreak performs fine-grained analysis of parameters essential to model utility and safety. TwinBreak is the first method to analyze intermediate outputs from prompts with high structural and content similarity to isolate safety parameters. We present the TwinPrompt dataset containing 100 such twin prompts. Experiments confirm TwinBreak's effectiveness, achieving 89% to 98% success rates with minimal computational requirements across 16 LLMs from five vendors.

πŸ” Key Points

  • Introduction of TwinBreak as a novel method for jailbreaking LLM security alignments, minimizing computational resources and manual effort.
  • Development of the TwinPrompt dataset, consisting of 100 pairs of harmful and harmless prompts enabling focused isolation of safety parameters.
  • TwinBreak employs fine-grained analysis and iterative pruning to effectively identify and remove parameters associated with safety alignment while preserving model utility.
  • Demonstrated effectiveness with high attack success rates (89%-98%) across 16 LLMs from multiple vendors with minimal runtime impact.
  • Provides insights into the structural vulnerability of LLMs and informs strategies for building more robust safety mechanisms.

πŸ’‘ Why This Paper Matters

The TwinBreak paper is crucial as it outlines a sophisticated approach to bypass LLM safety mechanisms, raising awareness of the potential vulnerabilities within widely used AI systems. By demonstrating that safety features can be effectively removed with minimal resources, it calls for improved security measures in LLM design and deployment.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly valuable for AI security researchers as it highlights the existing gaps in LLM safety alignments and demonstrates practical methods for exploiting these weaknesses. Understanding such vulnerabilities aids in developing stronger defenses and informs policy considerations regarding the deployment of powerful AI systems in sensitive environments.

πŸ“š Read the Full Paper