Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

📄 Abstract

Multi-modal Large Language Models (MLLMs) have achieved remarkable performance across a wide range of visual reasoning tasks, yet their vulnerability to safety risks remains a pressing concern. While prior research primarily focuses on jailbreak defenses that detect and refuse explicitly unsafe inputs, such approaches often overlook contextual safety, which requires models to distinguish subtle contextual differences between scenarios that may appear similar but diverge significantly in safety intent. In this work, we present MM-SafetyBench++, a carefully curated benchmark designed for contextual safety evaluation. Specifically, for each unsafe image-text pair, we construct a corresponding safe counterpart through minimal modifications that flip the user intent while preserving the underlying contextual meaning, enabling controlled evaluation of whether models can adapt their safety behaviors based on contextual understanding. Further, we introduce EchoSafe, a training-free framework that maintains a self-reflective memory bank to accumulate and retrieve safety insights from prior interactions. By integrating relevant past experiences into current prompts, EchoSafe enables context-aware reasoning and continual evolution of safety behavior during inference. Extensive experiments on various multi-modal safety benchmarks demonstrate that EchoSafe consistently achieves superior performance, establishing a strong baseline for advancing contextual safety in MLLMs. All benchmark data and code are available at https://echosafe-mllm.github.io.

🔍 Key Points

Proposes Structured Semantic Cloaking (S2C), a novel multi-dimensional framework for jailbreak attacks on large language models (LLMs) that involves semantic distribution and restructuring during inference.
S2C includes three mechanisms: Contextual Reframing, Content Fragmentation, and Clue-Guided Camouflage, which work together to bypass conventional safety measures that rely on coherent semantic intent.
Empirical evaluations show that S2C outperforms previous state-of-the-art methods by significant margins, achieving up to 89.4% Attack Success Rate (ASR) across various models.
Highlights the systematic vulnerability of LLMs to attacks that exploit deeper latent representations rather than surface-level obfuscation, indicating a gap in current safety mechanisms.
Demonstrates the importance of understanding not just the malicious intent but also the timing and structure of semantic reconstruction during generative processes.

💡 Why This Paper Matters

This paper introduces a significant advancement in the field of AI safety by illustrating a novel attack strategy against current safety mechanisms in LLMs. By manipulating the way malicious intent is formed and evaluated at deeper semantic levels, the authors expose relevant vulnerabilities within different model architectures. The findings and techniques proposed aim to contribute to a more robust discourse on AI safety mechanisms by highlighting what’s required to develop defenses that withstand more sophisticated attacks.

🎯 Why It's Interesting for AI Security Researchers

This research would be of considerable interest to AI security researchers for several reasons. Firstly, it challenges existing assumptions about the robustness of LLM safety measures, advocating for a reevaluation of their capabilities against increasingly complex adversarial strategies. Secondly, the techniques outlined provide actionable insights for understanding and potentially improving LLM defenses against manipulation, which is critical as such models are integrated into more safety-critical applications.

Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper