← Back to Library

CATCH: A Modular Cross-domain Adaptive Template with Hook

Authors: Xinjin Li, Yulie Lu, Jinghan Cao, Yu Ma, Zhenglin Li, Yeyang Zhou

Published: 2025-10-30

arXiv ID: 2510.26582v1

Added to Library: 2025-11-14 23:06 UTC

📄 Abstract

Recent advances in Visual Question Answering (VQA) have demonstrated impressive performance in natural image domains, with models like LLaVA leveraging large language models (LLMs) for open-ended reasoning. However, their generalization degrades significantly when transferred to out-of-domain scenarios such as remote sensing, medical imaging, or math diagrams, due to large distributional shifts and the lack of effective domain adaptation mechanisms. Existing approaches typically rely on per-domain fine-tuning or bespoke pipelines, which are costly, inflexible, and not scalable across diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for cross-domain adaptation that improves the generalization of VQA models while requiring minimal changes to their core architecture. Our key idea is to decouple visual and linguistic adaptation by introducing two lightweight modules: a domain classifier to identify the input image type, and a dual adapter mechanism comprising a Prompt Adapter for language modulation and a Visual Adapter for vision feature adjustment. Both modules are dynamically injected via a unified hook interface, requiring no retraining of the backbone model. Experimental results across four domain-specific VQA benchmarks demonstrate that our framework achieves consistent performance gains without retraining the backbone model, including +2.3 BLEU on MathVQA, +2.6 VQA on MedVQA-RAD, and +3.1 ROUGE on ChartQA. These results highlight that CATCH provides a scalable and extensible approach to multi-domain VQA, enabling practical deployment across diverse application domains.

🔍 Key Points

  • The paper introduces the concept of instruction hierarchy (IH) in large language models (LLMs), emphasizing the importance of resolving conflicting instructions based on priority.
  • A novel dataset called VerIH is created for training LLMs in reasoning about instruction hierarchies, incorporating both aligned and conflicting user-system prompt pairs.
  • The authors implement a lightweight reinforcement learning approach to enhance the model's reasoning capabilities, leading to improved instruction following and hierarchy compliance in diverse scenarios.
  • Experimental results demonstrate that the proposed framework not only improves performance on instruction-related tasks but also enhances model robustness against adversarial attacks such as jailbreak and prompt injection.
  • The research indicates that prioritizing high-level instructions can dynamically adjust model behavior, potentially increasing the flexibility and reliability of LLM applications in safety-critical contexts.

💡 Why This Paper Matters

This paper is relevant as it contributes significantly to the understanding of how LLMs can effectively navigate and prioritize conflicting instructions, a crucial requirement for AI systems that operate in safety-sensitive environments. By formulating instruction hierarchy resolution as a reasoning task and providing a structured dataset for training, the authors advance the field towards creating more reliable and controllable language models. Furthermore, the practical implications of this work are substantial, as they pave the way for the deployment of LLMs in real-world applications where robustness and adherence to specific directives are paramount.

🎯 Why It's Interesting for AI Security Researchers

This paper is of interest to AI security researchers due to its focus on enhancing the robustness of LLMs against adversarial threats. By framing the resolution of conflicting prompts as a reasoning challenge, the findings offer novel insights into preventing harmful behaviors triggered by malicious inputs. The results demonstrate a clear potential for improving safety mechanisms through the introduction of explicit instruction prioritization, which could be critical in enhancing the resilience of LLMs in the face of sophisticated attacks, thereby ensuring safer interactions between users and AI systems.

📚 Read the Full Paper