The Alignment Curse: Cross-Modality Jailbreak Transfer in Omni-Models

📄 Abstract

Recent advances in end-to-end trained omni-models have significantly improved multimodal understanding. At the same time, safety red-teaming has expanded beyond text to encompass audio-based jailbreak attacks. However, an important bridge between textual and audio jailbreaks remains underexplored. In this work, we study the cross-modality transfer of jailbreak attacks from text to audio, motivated by the semantic similarity between the two modalities and the maturity of textual jailbreak methods. We first analyze the connection between modality alignment and cross-modality jailbreak transfer, showing that strong alignment can inadvertently propagate textual vulnerabilities to the audio modality, which we term the alignment curse. Guided by this analysis, we conduct an empirical evaluation of textual jailbreaks, text-transferred audio jailbreaks, and existing audio-based jailbreaks on recent omni-models. Our results show that text-transferred audio jailbreaks perform comparably to, and often better than, audio-based jailbreaks, establishing them as simple yet powerful baselines for future audio red-teaming. We further demonstrate strong cross-model transferability and show that text-transferred audio attacks remain effective even under a stricter audio-only access threat model.

🔍 Key Points

The paper introduces the concept of the 'alignment curse,' where vulnerabilities from text modalities can inadvertently transfer to audio modalities due to strong multimodal alignment in omni-models.
Empirical evidence is provided indicating that text-transferred audio jailbreaks are as effective as, or more effective than, established audio-based jailbreak methods, establishing them as important baselines for future research.
The authors demonstrate that text-transferred audio attacks maintain effectiveness even in stricter audio-only access scenarios, revealing significant implications for audio red-teaming methodologies.
A detailed analysis is conducted on the conditions under which successful cross-modality transfer occurs, linking it to representation-level alignment and providing a mathematical framework for understanding this phenomenon.
The work highlights the urgent need for the development of robust defenses against cross-modality vulnerabilities in omni-models, as these systems become more widely deployed.

💡 Why This Paper Matters

This paper is critically relevant in the landscape of AI safety as it uncovers inherent vulnerabilities in multimodal AI systems, particularly how attacks can transfer between modalities. Understanding this alignment curse is crucial for developers and researchers working on the safety and robustness of advanced AI systems, emphasizing the need for evolving safety mechanisms alongside the advancement of these models.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper particularly significant as it opens up new avenues for exploring vulnerabilities in audio and text-driven models. The insights gained through the proposed alignment curse and the empirical findings related to jailbreak effectiveness will aid in designing better defense mechanisms and ensuring that future AI models are resistant to adversarial attacks across different input modalities.

The Alignment Curse: Cross-Modality Jailbreak Transfer in Omni-Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper