← Back to Library

Detoxifying LLMs via Representation Erasure-Based Preference Optimization

Authors: Nazanin Mohammadi Sepahvand, Eleni Triantafillou, Hugo Larochelle, Doina Precup, Daniel M. Roy, Gintare Karolina Dziugaite

Published: 2026-02-24

arXiv ID: 2602.23391v1

Added to Library: 2026-03-02 03:01 UTC

Red Teaming

📄 Abstract

Large language models (LLMs) trained on webscale data can produce toxic outputs, raising concerns for safe deployment. Prior defenses, based on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations, but not robustly so: they are vulnerable to adversarial prompting and easily undone by fine-tuning-based relearning attacks. Indeed, research has shown that these edits to the model are superficial: linear probing reveals that harmful "directions" remain present in representations. To address this, we propose Representation Erasure-based Preference Optimization (REPO), reformulating detoxification as a token-level preference problem. Using a novel objective with preference data, we force the representations of toxic continuations to converge toward their benign counterparts. Our mechanistic analysis reveals that this granular approach is critical: unlike baselines, REPO induces deep, localized edits to toxicity-encoding neurons while preserving general model utility. Exhaustive evaluations show that REPO achieves state-of-the-art robustness, stopping sophisticated threats-including relearning attacks and enhanced GCG jailbreaks-where existing representation- and output-based methods fail.

🔍 Key Points

  • Introduction of Representation Erasure-based Preference Optimization (REPO) as a novel method for detoxifying large language models by directly targeting internal representations rather than just outputs.
  • Mechanistic analysis demonstrating that REPO effectively modifies toxicity-encoding neurons while preserving general model utility, leading to significant improvements in robustness against adversarial attacks like relearning and enhanced jailbreaks.
  • Evaluation across various datasets showing that REPO outperforms state-of-the-art detoxification methods in reducing toxicity while maintaining language generation quality and utility metrics such as perplexity and F1 score.
  • REPO utilizes a pairwise supervision approach, where both benign and toxic continuations are used to train a discriminator, aligning representations of toxic outputs with corresponding non-toxic outputs at a granular token level.
  • Detailed ablation studies confirm the effectiveness of token-level intervention, showing that REPO's design enables deeper, localized edits in the model's architecture, providing a robust baseline for future work in model safety.

💡 Why This Paper Matters

This paper presents a groundbreaking approach to detoxifying large language models through REPO, which effectively erases harmful representations while preserving the model's overall capabilities. By directly targeting internal model representation, this method enhances the resilience of LLMs to adversarial attacks, thereby supporting the development of safer AI systems. With increased insights into model behavior and robust evaluations, this research provides crucial foundational work for advancing AI safety measures in real-world applications.

🎯 Why It's Interesting for AI Security Researchers

As AI systems become more integrated into society, ensuring their safe and responsible deployment is crucial. This paper is of particular interest to AI security researchers because it addresses the pressing issue of toxic content generation in large language models and introduces a novel, robust method to combat this risk. By providing a detailed framework for understanding and mitigating the internal representations that lead to harmful outputs, this research paves the way for developing safer AI technologies and informs best practices for their deployment.

📚 Read the Full Paper