← Back to Library

Randomized Smoothing Meets Vision-Language Models

Authors: Emmanouil Seferis, Changshun Wu, Stefanos Kollias, Saddek Bensalem, Chih-Hong Cheng

Published: 2025-09-19

arXiv ID: 2509.16088v1

Added to Library: 2025-09-22 04:00 UTC

Red Teaming

📄 Abstract

Randomized smoothing (RS) is one of the prominent techniques to ensure the correctness of machine learning models, where point-wise robustness certificates can be derived analytically. While RS is well understood for classification, its application to generative models is unclear, since their outputs are sequences rather than labels. We resolve this by connecting generative outputs to an oracle classification task and showing that RS can still be enabled: the final response can be classified as a discrete action (e.g., service-robot commands in VLAs), as harmful vs. harmless (content moderation or toxicity detection in VLMs), or even applying oracles to cluster answers into semantically equivalent ones. Provided that the error rate for the oracle classifier comparison is bounded, we develop the theory that associates the number of samples with the corresponding robustness radius. We further derive improved scaling laws analytically relating the certified radius and accuracy to the number of samples, showing that the earlier result of 2 to 3 orders of magnitude fewer samples sufficing with minimal loss remains valid even under weaker assumptions. Together, these advances make robustness certification both well-defined and computationally feasible for state-of-the-art VLMs, as validated against recent jailbreak-style adversarial attacks.

🔍 Key Points

  • Introduction of a method for applying Randomized Smoothing (RS) to generative models, particularly Vision-Language Models (VLMs), enabling robustness certification for outputs that are not merely categorical labels.
  • Development of theoretical foundations connecting RS with oracle classification tasks, which allow for the certification of generative outputs like harmful/harmless evaluations and discrete actions.
  • Analytical derivation of improved scaling laws relating the certified radius and accuracy to the number of samples required, demonstrating substantial reductions in sample complexity while maintaining certification effectiveness.
  • Empirical validation of the proposed RS extension against various adversarial attacks, including jailbreak-style attacks on state-of-the-art VLMs, confirming the practicality of the approach.

💡 Why This Paper Matters

This paper presents significant advancements in the field of AI safety by enabling robust certification methods for generative models, addressing a critical gap in the existing literature. Its integration of Randomized Smoothing into VLMs has the potential to enhance the reliability of AI systems in real-world applications, particularly in sensitive deployments like content moderation and interactions with service robots.

🎯 Why It's Interesting for AI Security Researchers

The intersection of adversarial robustness and generative AI is a burgeoning area of interest, making this paper vital for AI security researchers focusing on ensuring the integrity of AI systems against manipulative inputs. Its innovative techniques and empirical results directly contribute to the ongoing discourse on creating safer AI technologies, establishing benchmarks for future research in adversarial defenses.

📚 Read the Full Paper