← Back to Library

Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks

Authors: Havva Alizadeh Noughabi, Julien Serbanescu, Fattane Zarrinkalam, Ali Dehghantanha

Published: 2025-10-24

arXiv ID: 2510.21983v1

Added to Library: 2025-10-28 04:02 UTC

Red Teaming

📄 Abstract

Despite recent advances, Large Language Models remain vulnerable to jailbreak attacks that bypass alignment safeguards and elicit harmful outputs. While prior research has proposed various attack strategies differing in human readability and transferability, little attention has been paid to the linguistic and psychological mechanisms that may influence a model's susceptibility to such attacks. In this paper, we examine an interdisciplinary line of research that leverages foundational theories of persuasion from the social sciences to craft adversarial prompts capable of circumventing alignment constraints in LLMs. Drawing on well-established persuasive strategies, we hypothesize that LLMs, having been trained on large-scale human-generated text, may respond more compliantly to prompts with persuasive structures. Furthermore, we investigate whether LLMs themselves exhibit distinct persuasive fingerprints that emerge in their jailbreak responses. Empirical evaluations across multiple aligned LLMs reveal that persuasion-aware prompts significantly bypass safeguards, demonstrating their potential to induce jailbreak behaviors. This work underscores the importance of cross-disciplinary insight in addressing the evolving challenges of LLM safety. The code and data are available.

🔍 Key Points

  • Development of a novel adversarial prompt generation framework based on Cialdini's seven principles of persuasion, demonstrating systematic methodologies to bypass alignment safeguards in LLMs.
  • Empirical evaluations show that persuasion-aware prompts significantly increase jailbreak success rates across various aligned LLMs, indicating that persuasive structures can manipulate model behavior effectively.
  • Identification of distinct persuasive fingerprints for different LLMs, suggesting that models exhibit varying susceptibility to specific persuasive principles, which may inform future AI safety measures and forensic analysis.
  • Introduction of a novel metric—Influential Power—used to quantify the efficacy of persuasive principles in eliciting harmful outputs from LLMs, offering a more nuanced understanding of jailbreak efficacy than binary success metrics alone.
  • The paper highlights the importance of interdisciplinary approaches in AI safety by combining insights from social sciences and computational methods to address the vulnerabilities of LLMs in real-world applications.

💡 Why This Paper Matters

This paper is pivotal in the field of AI safety, as it not only uncovers how persuasive techniques can be used to exploit vulnerabilities in Large Language Models but also provides a structured approach to understanding and mitigating these threats. By establishing a framework that quantifies the effectiveness of various persuasive principles, the study offers valuable insights for the development of more robust AI models, better equipped to resist adverse manipulations. The findings underscore the need for continuous vigilance and innovative strategies in AI design and security.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant to AI security researchers as it reveals the intricate relationship between linguistic persuasion and model vulnerability, contributing to a deeper understanding of how adversarial prompts can bypass existing safety measures. Its exploration of the 'persuasive fingerprint' concept could aid in more effective model training and defense mechanisms, ultimately informing the design of safer AI systems. Moreover, the interdisciplinary methodology provides a novel lens through which to evaluate model robustness, inviting further exploration into the nuances of human communication strategies in adversarial contexts.

📚 Read the Full Paper