TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement

📄 Abstract

Autoregressive Model (AR) has shown remarkable success in conditional image generation. However, these approaches for multiple reference generation struggle with decoupling different reference identities. In this work, we propose the TokenAR framework, specifically focused on a simple but effective token-level enhancement mechanism to address reference identity confusion problem. Such token-level enhancement consists of three parts, 1). Token Index Embedding clusters the tokens index for better representing the same reference images; 2). Instruct Token Injection plays as a role of extra visual feature container to inject detailed and complementary priors for reference tokens; 3). The identity-token disentanglement strategy (ITD) explicitly guides the token representations toward independently representing the features of each identity.This token-enhancement framework significantly augments the capabilities of existing AR based methods in conditional image generation, enabling good identity consistency while preserving high quality background reconstruction. Driven by the goal of high-quality and high-diversity in multi-subject generation, we introduce the InstructAR Dataset, the first open-source, large-scale, multi-reference input, open domain image generation dataset that includes 28K training pairs, each example has two reference subjects, a relative prompt and a background with mask annotation, curated for multiple reference image generation training and evaluating. Comprehensive experiments validate that our approach surpasses current state-of-the-art models in multiple reference image generation task. The implementation code and datasets will be made publicly. Codes are available, see https://github.com/lyrig/TokenAR

🔍 Key Points

Introduction of the Malicious Token Injection (MTI) attack framework that targets the key-value cache of transformer models during inference, revealing a significant attack surface that has been largely overlooked.
Theoretically quantifies the impact of cache perturbations on attention mechanisms, linking the extent of cache corruption to shifts in token distribution and downstream model performance.
Empirical results show that the MTI framework consistently reduces task performance across various NLP benchmarks, including classification and question answering, highlighting the vulnerability during inference.
Identifies specific vulnerabilities in retrieval-augmented generation systems and agentic reasoning pipelines, demonstrating that perturbations in cached representations can significantly impair their functionality.
Presents lightweight defense strategies such as cache resetting and dropout-mask randomization that offer partial mitigation against cache corruption, underlining the importance of cache integrity in model robustness.

💡 Why This Paper Matters

This paper is highly relevant as it introduces a novel perspective on the vulnerabilities of large language models by focusing on the key-value cache during inference. The findings stress the importance of maintaining cache integrity to ensure robust and secure LLM deployments, particularly in safety-critical applications. It paves the way for future research on defending against cache-side attacks and understanding their broader implications in model behavior and reliability.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper is of significant interest as it not only exposes a critical and previously underexplored area of vulnerability in large language models but also provides a formalized methodology for evaluating such risks. The introduction of a systematic attack framework paired with theoretical and empirical validations presents a comprehensive case for reconsidering how LLM security is approached. As AI systems become more integrated into sensitive applications, understanding and mitigating these vulnerabilities is crucial.

TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper