← Back to Library

Are LLMs Good Safety Agents or a Propaganda Engine?

Authors: Neemesh Yadav, Francesco Ortu, Jiarui Liu, Joeun Yook, Bernhard SchΓΆlkopf, Rada Mihalcea, Alberto Cazzaniga, Zhijing Jin

Published: 2025-11-28

arXiv ID: 2511.23174v1

Added to Library: 2025-12-01 04:00 UTC

Red Teaming Safety

πŸ“„ Abstract

Large Language Models (LLMs) are trained to refuse to respond to harmful content. However, systematic analyses of whether this behavior is truly a reflection of its safety policies or an indication of political censorship, that is practiced globally by countries, is lacking. Differentiating between safety influenced refusals or politically motivated censorship is hard and unclear. For this purpose we introduce PSP, a dataset built specifically to probe the refusal behaviors in LLMs from an explicitly political context. PSP is built by formatting existing censored content from two data sources, openly available on the internet: sensitive prompts in China generalized to multiple countries, and tweets that have been censored in various countries. We study: 1) impact of political sensitivity in seven LLMs through data-driven (making PSP implicit) and representation-level approaches (erasing the concept of politics); and, 2) vulnerability of models on PSP through prompt injection attacks (PIAs). Associating censorship with refusals on content with masked implicit intent, we find that most LLMs perform some form of censorship. We conclude with summarizing major attributes that can cause a shift in refusal distributions across models and contexts of different countries.

πŸ” Key Points

  • Development of the PSP dataset to analyze LLM refusal behaviors in politically sensitive contexts.
  • Introduction of novel methodologies such as a data-driven approach and a representation-level approach to investigate censorship versus safety refusals.
  • Findings indicating significant censorship behaviors across different LLMs, with variations in sensitivity based on the political context of prompts.
  • Demonstration of prompt injection attacks (PIAs) revealing nuanced refusal behaviors, particularly in scenarios involving ethical dilemmas.

πŸ’‘ Why This Paper Matters

This paper highlights the critical distinction between safety refusals and politically motivated censorship in LLMs, providing a systematic framework for understanding these behaviors through the crafted PSP dataset. The outcomes suggest that many models participate in censorship contrary to their design intentions, underlining the need for transparency and reliability in AI systems.

🎯 Why It's Interesting for AI Security Researchers

This paper is of interest to AI security researchers as it addresses the serious implications of censorship in LLMs, showcasing how these models can act as tools for information suppression. The methodologies developed for analyzing refusal behaviors also have potential applications in auditing AI systems for biases and ensuring compliance with ethical standards in AI deployment.

πŸ“š Read the Full Paper