TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data

Authors: Aravind Cheruvu, Shravya Kanchi, Sifat Muhammad Abdullah, Nicholas Kong, Daphne Yao, Murtuza Jadliwala, Bimal Viswanath

Published: 2025-07-08

arXiv ID: 2507.05660v1

Added to Library: 2025-07-09 05:01 UTC

Red Teaming

📄 Abstract

Recent advances in foundation models, such as LLMs, have revolutionized conversational AI. Chatbots are increasingly being developed by customizing LLMs on specific conversational datasets. However, mitigating toxicity during this customization, especially when dealing with untrusted training data, remains a significant challenge. To address this, we introduce TuneShield, a defense framework designed to mitigate toxicity during chatbot fine-tuning while preserving conversational quality. TuneShield leverages LLM-based toxicity classification, utilizing the instruction-following capabilities and safety alignment of LLMs to effectively identify toxic samples, outperforming industry API services. TuneShield generates synthetic conversation samples, termed 'healing data', based on the identified toxic samples, using them to mitigate toxicity while reinforcing desirable behavior during fine-tuning. It performs an alignment process to further nudge the chatbot towards producing desired responses. Our findings show that TuneShield effectively mitigates toxicity injection attacks while preserving conversational quality, even when the toxicity classifiers are imperfect or biased. TuneShield proves to be resilient against adaptive adversarial and jailbreak attacks. Additionally, TuneShield demonstrates effectiveness in mitigating adaptive toxicity injection attacks during dialog-based learning (DBL).

🔍 Key Points

Introduction of TuneShield, a defense framework aimed at mitigating toxicity in conversational AI while allowing for fine-tuning on untrusted data.
Utilization of LLM-based toxicity classifiers to outperform existing industry services and generate 'healing data' to replace toxic samples.
Implementation of a novel model alignment process using Direct Preference Optimization (DPO) to reinforce desirable chatbot behaviors and mitigate the effect of biased toxicity classifiers.
Demonstrated resistance of TuneShield against various adversarial attacks, including toxicity injection and jailbreak attacks, while preserving the quality of conversation.
Experimental results showing significant reduction in toxicity and improvement in chatbot performance across different datasets.

💡 Why This Paper Matters

The paper presents TuneShield as a significant advancement in the field of conversational AI safety by providing a robust framework that not only addresses the challenge of toxicity in chatbots but also ensures the preservation of conversational quality. Its innovative use of LLMs for toxicity identification and handling of untrusted training data highlights its practical implications for building safer and more reliable AI systems.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant to AI security researchers as it addresses critical issues of data integrity and safety in conversational AI, demonstrating methods to counteract manipulation through toxic language and adversarial attacks. It provides insights into the capabilities and limitations of current defenses, contributing to discussions on enhancing resilience against emerging threats in AI systems.

TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper