← Back to Library

Preventing Shortcuts in Adapter Training via Providing the Shortcuts

Authors: Anujraaj Argo Goyal, Guocheng Gordon Qian, Huseyin Coskun, Aarush Gupta, Himmy Tam, Daniil Ostashev, Ju Hu, Dhritiman Sagar, Sergey Tulyakov, Kfir Aberman, Kuan-Chieh Jackson Wang

Published: 2025-10-23

arXiv ID: 2510.20887v1

Added to Library: 2025-11-14 23:08 UTC

📄 Abstract

Adapter-based training has emerged as a key mechanism for extending the capabilities of powerful foundation image generators, enabling personalized and stylized text-to-image synthesis. These adapters are typically trained to capture a specific target attribute, such as subject identity, using single-image reconstruction objectives. However, because the input image inevitably contains a mixture of visual factors, adapters are prone to entangle the target attribute with incidental ones, such as pose, expression, and lighting. This spurious correlation problem limits generalization and obstructs the model's ability to adhere to the input text prompt. In this work, we uncover a simple yet effective solution: provide the very shortcuts we wish to eliminate during adapter training. In Shortcut-Rerouted Adapter Training, confounding factors are routed through auxiliary modules, such as ControlNet or LoRA, eliminating the incentive for the adapter to internalize them. The auxiliary modules are then removed during inference. When applied to tasks like facial and full-body identity injection, our approach improves generation quality, diversity, and prompt adherence. These results point to a general design principle in the era of large models: when seeking disentangled representations, the most effective path may be to establish shortcuts for what should NOT be learned.

🔍 Key Points

  • Introduction of Soft Instruction Control (SIC), an iterative prompt sanitization loop to defend against prompt injection attacks in tool-augmented LLM agents.
  • SIC modularly processes untrusted input by rewriting, masking, or removing instructions, ensuring only safe commands reach the agent.
  • Empirical evaluations reveal that SIC achieves a 0% attack success rate (ASR) under a range of adversarial attacks, substantially reducing the risk of compromised agent behavior.
  • SIC maintains high utility on benign tasks while critically engaging with security-utility trade-offs; examples show careful balancing of benign instructions and attack prevention.

💡 Why This Paper Matters

The presented SIC method marks a significant advancement in the defense strategies against prompt injection attacks for Large Language Models integrated within autonomous systems. It provides a practical and effective solution that allows agents to operate securely while interacting with untrusted data, highlighting that security against adversarial inputs can be enhanced significantly without drastic compromises on performance.

🎯 Why It's Interesting for AI Security Researchers

This paper is crucial for AI security researchers as it addresses the emerging vulnerabilities faced by large language models in agentic systems, particularly the risks posed by prompt injection attacks. The systematic approach for mitigating these risks, along with empirical evaluations and comparative analyses against existing defenses, provides valuable insights for current and future research in AI security. It raises awareness about the necessity of robust defenses in deployed AI systems, especially as they become more autonomous and integrated into various applications.

📚 Read the Full Paper