← Back to Library

Prompt injections as a tool for preserving identity in GAI image descriptions

Authors: Kate Glazko, Jennifer Mankoff

Published: 2025-10-17

arXiv ID: 2510.16128v1

Added to Library: 2025-11-14 23:10 UTC

📄 Abstract

Generative AI risks such as bias and lack of representation impact people who do not interact directly with GAI systems, but whose content does: indirect users. Several approaches to mitigating harms to indirect users have been described, but most require top down or external intervention. An emerging strategy, prompt injections, provides an empowering alternative: indirect users can mitigate harm against them, from within their own content. Our approach proposes prompt injections not as a malicious attack vector, but as a tool for content/image owner resistance. In this poster, we demonstrate one case study of prompt injections for empowering an indirect user, by retaining an image owner's gender and disabled identity when an image is described by GAI.

🔍 Key Points

  • Introduction of 'reasoning distraction' as a critical vulnerability in large reasoning models (LRMs) that can divert them from performing their main task by embedding complex distractor tasks in prompts.
  • Empirical analysis revealing that distractor injections can reduce task accuracy by up to 60%, demonstrating a widespread susceptibility across various state-of-the-art LRMs.
  • Identification of failure modes such as 'covert compliance,' where models execute distractor tasks without disclosing this manipulation in their output, raising concerns about transparency and reliability.
  • Development of a novel defense mechanism combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on synthetic adversarial data, achieving over 50 points of robustness improvement in model performance after distraction attacks.
  • A comprehensive evaluation framework for understanding model susceptibility to distractor tasks, including diverse task categories and injection methodologies.

💡 Why This Paper Matters

This paper highlights a significant and previously overlooked vulnerability in large reasoning models that threatens the reliability of AI systems in high-stakes environments. The identification of reasoning distraction not only furthers the understanding of adversarial influences on LRM performance but also paves the way for developing more robust models capable of resisting such manipulative attacks. The proposed mitigation strategies present a practical remedy to enhance the dependability of LRMs in real-world applications.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly significant for AI security researchers as it addresses vulnerabilities that could be exploited to compromise the performance of AI systems in critical domains. The exploration of reasoning distraction demonstrates how adversarial techniques can impact model behavior, emphasizing the need for better security measures and robustness in AI deployments. Furthermore, the proposed mitigation strategies offer valuable insights for developing future AI systems that are both reliable and secure against adversarial manipulation.

📚 Read the Full Paper