When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment

Authors: Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi

Published: 2025-06-09

arXiv ID: 2506.07452v1

Added to Library: 2025-06-10 04:01 UTC

Red Teaming

📄 Abstract

Large language models (LLMs) can be prompted with specific styles (e.g., formatting responses as lists), including in jailbreak queries. Although these style patterns are semantically unrelated to the malicious intents behind jailbreak queries, their safety impact remains unclear. In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment. We evaluate 32 LLMs across seven jailbreak benchmarks, and find that malicious queries with style patterns inflate the attack success rate (ASR) for nearly all models. Notably, ASR inflation correlates with both the length of style patterns and the relative attention an LLM exhibits on them. We then investigate superficial style alignment, and find that fine-tuning with specific styles makes LLMs more vulnerable to jailbreaks of those same styles. Finally, we propose SafeStyle, a defense strategy that incorporates a small amount of safety training data augmented to match the distribution of style patterns in the fine-tuning data. Across three LLMs and five fine-tuning style settings, SafeStyle consistently outperforms baselines in maintaining LLM safety.

🔍 Key Points

Identification of ASR inflation in LLMs due to superficial style alignment, showing that nearly all tested models exhibit increased vulnerability to jailbreak queries that contain style patterns.
Exploration of the mechanisms by which superficial style alignment increases safety risks, highlighting the correlation between model attention to style patterns and ASR inflation.
Development of the SafeStyle defense strategy, which integrates safety training data tailored to the stylistic patterns used in the LLMs' fine-tuning, effectively mitigating safety risks without sacrificing model utility.

💡 Why This Paper Matters

This paper is essential as it addresses a critical vulnerability in large language models (LLMs) related to style patterns in input queries, which can lead to increased susceptibility to malicious jailbreak attacks. By establishing the connection between superficial alignment and safety risks, and proposing an effective solution, this research significantly contributes to enhancing the safety and robustness of LLMs.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper are particularly pertinent to AI security researchers focused on developing robust defenses against adversarial prompts and ensuring the safe deployment of AI systems. The identification of ASR inflation linked to superficial style alignment raises awareness of potential vulnerabilities in current LLM alignment techniques, guiding future research and the development of more resilient models.

When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper