← Back to Library

SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls

Authors: Qianxun Xu, Chenxi Song, Yujun Cai, Chi Zhang

Published: 2026-02-27

arXiv ID: 2602.23956v2

Added to Library: 2026-03-24 03:03 UTC

📄 Abstract

Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, without explicit temporal grounding, such models often produce blended or collapsed scenes that break the intended narrative. To address this limitation, we present SwitchCraft, a training-free framework for multi-event video generation. Our key insight is that uniform prompt injection across time ignores the correspondence between events and frames. To this end, we introduce Event-Aligned Query Steering (EAQS), which steers frame-level attention to align with relevant event prompts. Furthermore, we propose Auto-Balance Strength Solver (ABSS), which adaptively balances steering strength to preserve temporal consistency and visual fidelity. Extensive experiments demonstrate that SwitchCraft substantially improves prompt alignment, event clarity, and scene consistency compared with existing baselines, offering a simple yet effective solution for multi-event video generation.

🔍 Key Points

  • Introduction of JANUS, a lightweight framework for jailbreaking T2I models by optimizing structured prompt distributions under black-box end-to-end rewards.
  • JANUS replaces high-capacity generators with a low-dimensional mixing policy, greatly increasing scalability and efficiency in bypassing safety filters.
  • Demonstrated substantial improvements in attack success rates, achieving an ASR-8 increase from 25.30% to 43.15% on Stable Diffusion models.
  • The proposed methodology exposes structural weaknesses in existing T2I safety mechanisms and highlights the need for more robust defenses against adversarial prompts.
  • Empirical evidence across multiple T2I models shows that JANUS can evade both text and image safety filters effectively, indicating its potential as a tool for AI security evaluations.

💡 Why This Paper Matters

The paper presents JANUS as a pivotal advancement in understanding and attacking the vulnerabilities within text-to-image generation models. By providing a novel approach to crafting adversarial prompts efficiently, it not only enhances the success rate of jailbreak attacks but also sheds light on the underlying weaknesses of existing model safety mechanisms. Such contributions are crucial for researchers aiming to bolster AI security frameworks lest adversarial abuse proliferates.

🎯 Why It's Interesting for AI Security Researchers

This paper is significantly relevant to AI security researchers due to its focus on circumventing safety measures in prevalent T2I systems. It highlights the critical need for enhanced defenses and serves as a countermeasure analysis tool, providing insights into how current models can be exploited. Understanding and mitigating such vulnerabilities is essential for the responsible deployment of AI systems.

📚 Read the Full Paper