Relationship-Aware Safety Unlearning for Multimodal LLMs

Authors: Vishnu Narayanan Anilkumar, Abhijith Sreesylesh Babu, Trieu Hai Vo, Mohankrishna Kolla, Alexander Cuneo

Published: 2026-03-15

arXiv ID: 2603.14185v1

Added to Library: 2026-03-17 03:01 UTC

Safety

📄 Abstract

Generative multimodal models can exhibit safety failures that are inherently relational: two benign concepts can become unsafe when linked by a specific action or relation (e.g., child-drinking-wine). Existing unlearning and concept-erasure approaches often target isolated concepts or image-text pairs, which can cause collateral damage to benign uses of the same objects and relations. We propose relationship-aware safety unlearning: a framework that explicitly represents unsafe object-relation-object (O-R-O) tuples and applies targeted parameter-efficient edits (LoRA) to suppress unsafe tuples while preserving object marginals and safe neighboring relations. We include CLIP-based experiments and robustness evaluation under paraphrase, contextual, and out-of-distribution image attacks.

🔍 Key Points

Introduction of a novel relationship-aware safety unlearning framework for generative multimodal models that effectively addresses relational safety failures by targeting unsafe object-relation-object (O-R-O) tuples.
Implementation of a systematic approach that combines graph-based representation of unsafe relationships with targeted parameter-efficient edits (via LoRA) to suppress unsafe tuples while preserving benign associations.
Demonstration of the framework's effectiveness through rigorous experiments on CLIP, showcasing significant reductions in unsafe relationship recognition, alongside strong preservation of utility in safe and neutral scenarios.
Development of a multi-objective loss function that balances selective forgetting of harmful associations with retention of useful knowledge, highlighting the importance of comprehensive evaluation metrics in assessing model performance.
Future work directions that include exploration of causal tracing, hierarchical graph structures, adversarial data generation for robustness, and scaling the framework for larger generative models.

💡 Why This Paper Matters

The paper presents a significant advancement in the field of machine unlearning and AI safety. By focusing on relational safety in generative multimodal models, it addresses a critical gap in current methodologies that often overlook the complexities of relationships that lead to unsafe outputs. The proposed framework not only enhances safety measures but also ensures that model usability is preserved, marking a crucial step towards ethical AI deployment.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant to AI security researchers as it tackles the pressing issue of safety in generative models—a core concern in AI ethics and deployment. The framework introduced has implications for mitigating risks associated with harmful outputs, thus contributing to safer AI systems. Additionally, its findings on the efficacy of selective unlearning techniques could inform future research on improving robustness against adversarial attacks, making it a valuable resource for scholars and practitioners in the AI security domain.

Relationship-Aware Safety Unlearning for Multimodal LLMs

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper