UniFusion: Vision-Language Model as Unified Encoder in Image Generation

📄 Abstract

Although recent advances in visual generation have been remarkable, most existing architectures still depend on distinct encoders for images and text. This separation constrains diffusion models' ability to perform cross-modal reasoning and knowledge transfer. Prior attempts to bridge this gap often use the last layer information from VLM, employ multiple visual encoders, or train large unified models jointly for text and image generation, which demands substantial computational resources and large-scale data, limiting its accessibility.We present UniFusion, a diffusion-based generative model conditioned on a frozen large vision-language model (VLM) that serves as a unified multimodal encoder. At the core of UniFusion is the Layerwise Attention Pooling (LAP) mechanism that extracts both high level semantics and low level details from text and visual tokens of a frozen VLM to condition a diffusion generative model. We demonstrate that LAP outperforms other shallow fusion architectures on text-image alignment for generation and faithful transfer of visual information from VLM to the diffusion model which is key for editing. We propose VLM-Enabled Rewriting Injection with Flexibile Inference (VERIFI), which conditions a diffusion transformer (DiT) only on the text tokens generated by the VLM during in-model prompt rewriting. VERIFI combines the alignment of the conditioning distribution with the VLM's reasoning capabilities for increased capabilities and flexibility at inference. In addition, finetuning on editing task not only improves text-image alignment for generation, indicative of cross-modality knowledge transfer, but also exhibits tremendous generalization capabilities. Our model when trained on single image editing, zero-shot generalizes to multiple image references further motivating the unified encoder design of UniFusion.

🔍 Key Points

First systematic analysis of poisoning risks in LLM-based prompt optimization, highlighting vulnerabilities in feedback manipulation over query manipulation.
Identification of the fake-reward attack that significantly raises attack success rates by providing misleading feedback without access to the reward model.
Development of a lightweight defense mechanism (highlighting) that effectively mitigates the impact of fake-reward attacks while maintaining system utility.
Empirical evidence showing that prompt optimization metrics critically influence the susceptibility of systems to adversarial exploitation, emphasizing the need for careful metric selection.
Proposed an actionable framework for securing feedback channels in LLM-based optimizers, marking prompt optimization pipelines as a new attack surface in AI safety.

💡 Why This Paper Matters

This paper is crucial for advancing the understanding of security vulnerabilities inherent in LLM-based optimization processes. By presenting a novel class of feedback manipulation attacks and highlighting the importance of robust defense mechanisms, the authors contribute significantly to the field of AI safety. Their findings advocate for re-evaluating existing practices in prompt optimization pipelines, which are increasingly used in real-world applications.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly relevant as it addresses a critical gap in the literature concerning the vulnerabilities specific to LLM-based optimization methods. The exploration of feedback manipulation attacks, the introduction of the fake-reward attack, and the proposed defenses align with ongoing concerns about the security of machine learning systems. As these models become more embedded in sensitive applications, understanding their threats and implementing effective safeguards is paramount.

UniFusion: Vision-Language Model as Unified Encoder in Image Generation

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper