← Back to Library

$PC^2$: Politically Controversial Content Generation via Jailbreaking Attacks on GPT-based Text-to-Image Models

Authors: Wonwoo Choi, Minjae Seo, Minkyoo Song, Hwanjo Heo, Seungwon Shin, Myoungsung You

Published: 2026-01-08

arXiv ID: 2601.05150v1

Added to Library: 2026-01-09 03:00 UTC

Red Teaming

📄 Abstract

The rapid evolution of text-to-image (T2I) models has enabled high-fidelity visual synthesis on a global scale. However, these advancements have introduced significant security risks, particularly regarding the generation of harmful content. Politically harmful content, such as fabricated depictions of public figures, poses severe threats when weaponized for fake news or propaganda. Despite its criticality, the robustness of current T2I safety filters against such politically motivated adversarial prompting remains underexplored. In response, we propose $PC^2$, the first black-box political jailbreaking framework for T2I models. It exploits a novel vulnerability where safety filters evaluate political sensitivity based on linguistic context. $PC^2$ operates through: (1) Identity-Preserving Descriptive Mapping to obfuscate sensitive keywords into neutral descriptions, and (2) Geopolitically Distal Translation to map these descriptions into fragmented, low-sensitivity languages. This strategy prevents filters from constructing toxic relationships between political entities within prompts, effectively bypassing detection. We construct a benchmark of 240 politically sensitive prompts involving 36 public figures. Evaluation on commercial T2I models, specifically GPT-series, shows that while all original prompts are blocked, $PC^2$ achieves attack success rates of up to 86%.

🔍 Key Points

  • Introduction of the PC² framework, the first black-box political jailbreaking method specifically designed for text-to-image (T2I) models, allowing the generation of politically controversial content that bypasses safety filters.
  • Utilization of Identity-Preserving Descriptive Mapping (IPDM) and Geopolitically Distal Translation techniques to obfuscate sensitive keywords and leverage language disparities, achieving attack success rates of up to 86% across various T2I models.
  • Development of a benchmark dataset containing 240 politically sensitive prompts involving 36 public figures, providing a systematic basis for evaluating the efficacy of T2I safety filters against politically motivated adversarial prompting.
  • Comprehensive evaluation methodology that includes translation performance, model sensitivity, political jailbreaking attack performance, and a root-cause analysis, thereby elucidating the mechanisms behind the failures of existing safety measures.
  • Discussion of the broader implications of political jailbreaking techniques in the context of misinformation and the challenges faced by current safety mechanisms in commercial T2I models.

💡 Why This Paper Matters

This paper is crucial as it highlights a significant gap in the robustness of safety filters in state-of-the-art T2I models against adversarial political prompting, providing empirical evidence of vulnerabilities that could be exploited for disinformation. With the rise of generative AI and its societal implications, understanding these vulnerabilities can inform the development of more effective safety measures and context-aware moderation strategies.

🎯 Why It's Interesting for AI Security Researchers

The findings presented in this paper would significantly interest AI security researchers as they demonstrate the potential for misuse of AI technologies in generating politically sensitive content. The novel methods introduced challenge existing safety mechanisms, providing insights into their limitations and prompting further research on robust defenses against adversarial attacks. Additionally, the implications for misinformation in political contexts highlight the increasing importance of AI ethics and responsible deployment of generative models.

📚 Read the Full Paper