← Back to Library

Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach

Authors: Xiangfang Li, Yu Wang, Bo Li

Published: 2025-10-01

arXiv ID: 2510.01342v1

Added to Library: 2025-10-03 04:03 UTC

Red Teaming

📄 Abstract

With the rapid advancement of large language models (LLMs), ensuring their safe use becomes increasingly critical. Fine-tuning is a widely used method for adapting models to downstream tasks, yet it is vulnerable to jailbreak attacks. However, most existing studies focus on overly simplified attack scenarios, limiting their practical relevance to real-world defense settings. To make this risk concrete, we present a three-pronged jailbreak attack and evaluate it against provider defenses under a dataset-only black-box fine-tuning interface. In this setting, the attacker can only submit fine-tuning data to the provider, while the provider may deploy defenses across stages: (1) pre-upload data filtering, (2) training-time defensive fine-tuning, and (3) post-training safety audit. Our attack combines safety-styled prefix/suffix wrappers, benign lexical encodings (underscoring) of sensitive tokens, and a backdoor mechanism, enabling the model to learn harmful behaviors while individual datapoints appear innocuous. Extensive experiments demonstrate the effectiveness of our approach. In real-world deployment, our method successfully jailbreaks GPT-4.1 and GPT-4o on the OpenAI platform with attack success rates above 97% for both models. Our code is available at https://github.com/lxf728/tri-pronged-ft-attack.

🔍 Key Points

  • The paper introduces a three-pronged jailbreak attack strategy specifically designed for fine-tuning large language models (LLMs) in a highly constrained black-box setting, focusing on evading defenses implemented by service providers.
  • The proposed attack employs prefix/suffix wrappers, benign lexical encodings (underscoring) of harmful keywords, and a trigger-based backdoor approach to effectively learn harmful behaviors while appearing innocuous in the dataset.
  • Experiments demonstrate the approach's effectiveness, achieving over 97% attack success rates when targeting GPT-4 models through real-world fine-tuning APIs, while maintaining the general utility of the LLMs.
  • The authors provide a detailed threat model and experimental setup that realistically represents the constraints of commercial fine-tuning environments, enhancing the relevance of their findings.
  • The study identifies significant gaps in current defense mechanisms against fine-tuning attacks, suggesting a need for improved end-to-end safety measures for model providers.

💡 Why This Paper Matters

This paper highlights a critical vulnerability in the safety alignment of fine-tuned models by demonstrating how attackers can effectively exploit these models even with robust filtering and auditing processes in place. The findings emphasize the need for more resilient defensive strategies for large language models, making it a vital contribution to the current discourse on AI safety.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly interesting as it not only uncovers existing vulnerabilities in widely used large language models but also provides a practical attack framework that challenges current defense methodologies. The insights gained from this research could inform the development of more robust countermeasures, thus advancing the overall security posture of AI systems.

📚 Read the Full Paper