← Back to Library

Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

Authors: Hyunjun Kim

Published: 2026-01-01

arXiv ID: 2601.00454v1

Added to Library: 2026-01-07 10:04 UTC

📄 Abstract

Guardrail models are essential for ensuring the safety of Large Language Model (LLM) deployments, but processing full multi-turn conversation histories incurs significant computational cost. We propose Defensive M2S, a training paradigm that fine-tunes guardrail models on Multi-turn to Single-turn (M2S) compressed conversations rather than complete dialogue histories. We provide a formal complexity analysis showing that M2S reduces training cost from $O(n^2)$ to $O(n)$ for $n$-turn conversations. Empirically, on our training dataset (779 samples, avg. 10.6 turns), M2S requires only 169K tokens compared to 15.7M tokens for the multi-turn baseline -- a 93$\times$ reduction. We evaluate Defensive M2S across three guardrail model families (LlamaGuard, Nemotron, Qwen3Guard) and three compression templates (hyphenize, numberize, pythonize) on SafeDialBench, a comprehensive multi-turn jailbreak benchmark. Our best configuration, Qwen3Guard with hyphenize compression, achieves 93.8% attack detection recall while reducing inference tokens by 94.6% (from 3,231 to 173 tokens per conversation). This represents a 38.9 percentage point improvement over the baseline while dramatically reducing both training and inference costs. Our findings demonstrate that M2S compression can serve as an effective efficiency technique for guardrail deployment, enabling scalable safety screening of long multi-turn conversations.

🔍 Key Points

  • Investigation of emoji-based jailbreaking techniques revealing significant vulnerabilities in LLMs' safety alignment mechanisms.
  • Empirical analysis of 50 emoji-augmented prompts on four different LLMs, providing insights into model-specific weaknesses and varying success rates of jailbreaking attempts.
  • Identification of substantial performance discrepancies across models, with Qwen 2 7B demonstrating 0% success in jailbreak attempts while others had up to 10%.
  • Quantitative evaluation using statistical methods, confirming significant differences in model performance and ethical compliance rates through a chi-square test.
  • Discussion of the implications for model training and deployment, emphasizing the need for improved handling of non-textual prompts in LLM safety protocols.

💡 Why This Paper Matters

This paper highlights critical vulnerabilities in LLMs regarding emoji-based adversarial prompting, exposing limitations in current safety alignment mechanisms. Its findings underscore the necessity for adaptive and robust defenses against emerging threats in AI, particularly in the context of maintaining ethical standards in machine learning applications.

🎯 Why It's Interesting for AI Security Researchers

The study presents crucial insights into the robustness and safety of LLMs, a topic of significant concern in AI security research. By demonstrating how non-textual inputs can bypass existing safety measures, this work can inform the development of more resilient AI systems and contribute to the standardization of evaluation methods for LLMs, making it highly relevant for researchers focused on security and safety in AI.

📚 Read the Full Paper