Directional Embedding Smoothing for Robust Vision Language Models

Authors: Ye Wang, Jing Liu, Toshiaki Koike-Akino

Published: 2026-03-16

arXiv ID: 2603.15259v1

Added to Library: 2026-03-17 04:01 UTC

Red Teaming

📄 Abstract

The safety and reliability of vision-language models (VLMs) are a crucial part of deploying trustworthy agentic AI systems. However, VLMs remain vulnerable to jailbreaking attacks that undermine their safety alignment to yield harmful outputs. In this work, we extend the Randomized Embedding Smoothing and Token Aggregation (RESTA) defense to VLMs and evaluate its performance against the JailBreakV-28K benchmark of multi-modal jailbreaking attacks. We find that RESTA is effective in reducing attack success rate over this diverse corpus of attacks, in particular, when employing directional embedding noise, where the injected noise is aligned with the original token embedding vectors. Our results demonstrate that RESTA can contribute to securing VLMs within agentic systems, as a lightweight, inference-time defense layer of an overall security framework.

🔍 Key Points

Introduction of Directional Embedding Smoothing and Token Aggregation (RESTA) defense for vision-language models (VLMs).
Demonstrated effective reduction in the success rate of jailbreaking attacks, particularly with directional embedding noise aligned to original token vectors.
Conducted extensive experiments using the JailBreakV-28K benchmark to validate the performance of RESTA on VLMs like LLaVA and Gemma, showcasing the trade-off between security and utility.
Highlighted limitations of current security measures against adaptive attacks, suggesting the necessity for a multi-layer defense strategy in practical implementations.
Discussion of the theoretical implications of RESTA's success and avenues for future research into alternative noise variants and adaptive attack defenses.

💡 Why This Paper Matters

This paper represents a significant step in enhancing the security of vision-language models through the use of the RESTA method. It identifies the importance of embedding directionality in preserving semantic integrity while providing a robust defense against jailbreaking attacks. By presenting empirical results and safety-utility trade-offs, the findings contribute to the foundation needed for developing trustworthy AI systems, particularly as they relate to safety-aligned models.

🎯 Why It's Interesting for AI Security Researchers

The research is highly relevant to AI security researchers as it addresses a critical vulnerability in vision-language models—a field growing in prominence with the rise of agentic AI systems. The proposed RESTA defense method not only introduces a novel approach to mitigating jailbreaking risks but also encourages further exploration in adversarial robustness. This aligns with ongoing efforts in the AI community to establish secure frameworks for AI deployment, making it essential reading for those focused on the security and ethical implications of AI technologies.

Directional Embedding Smoothing for Robust Vision Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper