← Back to Library

Relationship-Aware Safety Unlearning for Multimodal LLMs

Authors: Vishnu Narayanan Anilkumar, Abhijith Sreesylesh Babu, Trieu Hai Vo, Mohankrishna Kolla, Alexander Cuneo

Published: 2026-03-15

arXiv ID: 2603.14185v2

Added to Library: 2026-03-18 03:01 UTC

Safety

📄 Abstract

Generative multimodal models can exhibit safety failures that are inherently relational: two benign concepts can become unsafe when linked by a specific action or relation (e.g., child-drinking-wine). Existing unlearning and concept-erasure approaches often target isolated concepts or image-text pairs, which can cause collateral damage to benign uses of the same objects and relations. We propose relationship-aware safety unlearning: a framework that explicitly represents unsafe object-relation-object (O-R-O) tuples and applies targeted parameter-efficient edits (LoRA) to suppress unsafe tuples while preserving object marginals and safe neighboring relations. We include CLIP-based experiments and robustness evaluation under paraphrase, contextual, and out-of-distribution image attacks.

🔍 Key Points

  • Introduction of relationship-aware safety unlearning methodology that targets unsafe object-relation-object (O-R-O) tuples in multimodal LLMs, addressing a gap in existing unlearning techniques that do not consider relational contexts.
  • Design of a graphical representation of relationships that aids in systematic unlearning while preserving safe concepts, ensuring minimal collateral damage to benign uses of similar objects or relations.
  • Implementation of targeted parameter-efficient edits using Low-Rank Adaptation (LoRA) to adapt large pre-trained models, demonstrating effective selective unlearning without significantly affecting model utility.
  • Comprehensive experimental evaluation showcasing the framework's robustness against various adversarial attacks including paraphrase, contextual, and out-of-distribution attacks, highlighting the method's utility preservation capabilities.
  • Ablation studies confirming that the multi-objective loss function, incorporating consistency and adversarial loss, is crucial for balancing effective unlearning and utility retention.

💡 Why This Paper Matters

This paper presents a significant advancement in the field of multimodal models by proposing a novel framework for safely unlearning relational associations. The methodology not only demonstrates effective unlearning of potentially harmful content but also ensures that the model maintains its overall performance and utility. This approach is critical for the responsible development and deployment of AI systems, particularly in applications where safety and ethical considerations are paramount.

🎯 Why It's Interesting for AI Security Researchers

The findings and methods described in this paper are of utmost interest to AI security researchers as they directly tackle the challenges associated with safety failures in generative models. By focusing on relational contexts, this research introduces a more nuanced approach to unlearning in AI systems, which is essential for mitigating risks associated with harmful content generation. Additionally, the robustness evaluations against various adversarial techniques provide important insights into the resilience of such models, offering valuable frameworks for future research in AI safety and security.

📚 Read the Full Paper