← Back to Library

Towards Mechanistic Defenses Against Typographic Attacks in CLIP

Authors: Lorenz Hufe, Constantin Venhoff, Maximilian Dreyer, Sebastian Lapuschkin, Wojciech Samek

Published: 2025-08-28

arXiv ID: 2508.20570v1

Added to Library: 2025-08-29 04:00 UTC

📄 Abstract

Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, our method improves performance by up to 19.6% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.

🔍 Key Points

  • Introduction of JADES, a decompositional scoring framework for evaluating jailbreak attempts in large language models (LLMs) that improves accuracy and transparency in assessments.
  • Validation of JADES through the creation of two benchmark datasets, JailbreakQR and HarmfulQA, allowing for extensive testing against existing methodologies.
  • Demonstration that previous automated evaluation methods significantly overestimate the success rates of jailbreak attacks, highlighting the inaccuracies inherent in binary classification methods.
  • Introduction of a fact-checking module within JADES to address hallucinations in generated content, further strengthening the reliability of evaluations.
  • Empirical evidence indicating that most jailbreak attempts classified as successful by traditional metrics are often only partially successful, calling for a nuanced understanding of jailbreak effectiveness.

💡 Why This Paper Matters

The JADES framework represents a significant advance in the ability to accurately and consistently assess the effectiveness of jailbreaks in large language models. By employing a nuanced, decompositional approach and validating results against well-structured benchmarks, this work provides both clarity and reliability in the evaluation of safety circumventing strategies. Thus, it establishes a stronger foundation for future research and application in AI safety, security, and model robustness.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it addresses a pressing issue: the evaluation of jailbreak effectiveness against AI models that handle sensitive data. By introducing a systematic framework for assessment, it not only enhances understanding of jailbreak vulnerabilities but also aids in the development of more robust safety measures. Furthermore, the discussions on overestimation of jailbreak risks challenge existing narratives, encouraging researchers to adopt more reliable evaluation metrics in their own work.

📚 Read the Full Paper