Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection

📄 Abstract

Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction, including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise visual variance into the context embedding. We further apply Gaussian perturbation to encourage the prompts to capture diverse visual variations of a verb. Second, we retrieve region-specific concepts from the human, object, and union regions. These are used to augment the diversity-aware prompt embeddings, yielding region-aware prompts that enhance verb-level discrimination. Experiments on the HICO-DET benchmark demonstrate that our method achieves state-of-the-art performance under four zero-shot evaluation settings, effectively addressing both intra-class diversity and inter-class visual entanglement. Code is available at https://github.com/mlvlab/VDRP.

🔍 Key Points

Agent Skills are a newly introduced framework that allows agents to dynamically utilize knowledge based on markdown files, which presents a risk for prompt injections.
The authors demonstrate how malicious instructions can be hidden within Agent Skills to exfiltrate sensitive data, indicating a significant security vulnerability in such frameworks.
A key finding is the ability to bypass system-level guardrails by exploiting benign actions, which can be dangerous if users select options that allow actions without further prompts.
Experiments revealed that malicious scripts can be executed without user confirmation if the 'Don't ask again' feature is enabled, showcasing an exploitation pathway for attackers.
The paper emphasizes the importance of more robust defenses and alerts users against third-party Agent Skills that are not vetted for security.

💡 Why This Paper Matters

This paper is relevant as it exposes significant security vulnerabilities in the Agent Skills framework for LLMs, a critical aspect of ongoing developments in AI. By highlighting the ease with which malicious actions can be implemented and the potential consequences of such vulnerabilities, the paper serves as a call for improved security measures and oversight in AI applications that utilize similar architectures.

🎯 Why It's Interesting for AI Security Researchers

The paper would be of interest to AI security researchers as it uncovers a novel attack vector related to prompt injections, particularly in the context of continually learning models. The findings prompt further investigation into the security implications of dynamic knowledge integration in LLMs and underline the necessity for improved safeguarding mechanisms against even simple injections, which can have far-reaching impacts in practice.

Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper