Compositional Image Synthesis with Inference-Time Scaling

📄 Abstract

Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge reranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. The code are available at https://github.com/gcl-inha/ReFocus.

🔍 Key Points

Agent Skills are a newly introduced framework that allows agents to dynamically utilize knowledge based on markdown files, which presents a risk for prompt injections.
The authors demonstrate how malicious instructions can be hidden within Agent Skills to exfiltrate sensitive data, indicating a significant security vulnerability in such frameworks.
A key finding is the ability to bypass system-level guardrails by exploiting benign actions, which can be dangerous if users select options that allow actions without further prompts.
Experiments revealed that malicious scripts can be executed without user confirmation if the 'Don't ask again' feature is enabled, showcasing an exploitation pathway for attackers.
The paper emphasizes the importance of more robust defenses and alerts users against third-party Agent Skills that are not vetted for security.

💡 Why This Paper Matters

This paper is relevant as it exposes significant security vulnerabilities in the Agent Skills framework for LLMs, a critical aspect of ongoing developments in AI. By highlighting the ease with which malicious actions can be implemented and the potential consequences of such vulnerabilities, the paper serves as a call for improved security measures and oversight in AI applications that utilize similar architectures.

🎯 Why It's Interesting for AI Security Researchers

The paper would be of interest to AI security researchers as it uncovers a novel attack vector related to prompt injections, particularly in the context of continually learning models. The findings prompt further investigation into the security implications of dynamic knowledge integration in LLMs and underline the necessity for improved safeguarding mechanisms against even simple injections, which can have far-reaching impacts in practice.

Compositional Image Synthesis with Inference-Time Scaling

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper