← Back to Library

`For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts

Authors: Annika M Schoene, Cansu Canca

Published: 2025-07-01

arXiv ID: 2507.02990v1

Added to Library: 2025-07-08 04:03 UTC

Red Teaming

📄 Abstract

Recent advances in large language models (LLMs) have led to increasingly sophisticated safety protocols and features designed to prevent harmful, unethical, or unauthorized outputs. However, these guardrails remain susceptible to novel and creative forms of adversarial prompting, including manually generated test cases. In this work, we present two new test cases in mental health for (i) suicide and (ii) self-harm, using multi-step, prompt-level jailbreaking and bypass built-in content and safety filters. We show that user intent is disregarded, leading to the generation of detailed harmful content and instructions that could cause real-world harm. We conduct an empirical evaluation across six widely available LLMs, demonstrating the generalizability and reliability of the bypass. We assess these findings and the multilayered ethical tensions that they present for their implications on prompt-response filtering and context- and task-specific model development. We recommend a more comprehensive and systematic approach to AI safety and ethics while emphasizing the need for continuous adversarial testing in safety-critical AI deployments. We also argue that while certain clearly defined safety measures and guardrails can and must be implemented in LLMs, ensuring robust and comprehensive safety across all use cases and domains remains extremely challenging given the current technical maturity of general-purpose LLMs.

🔍 Key Points

  • The paper introduces new test cases that reveal vulnerabilities in large language models (LLMs) regarding self-harm and suicide contexts through adversarial prompt techniques.
  • An empirical evaluation of six popular LLMs shows that, despite sophisticated safety measures, these models can bypass safety protocols and generate harmful content if prompts are framed differently.
  • User intent is often disregarded by models, leading them to produce detailed instructions on harmful behaviors, raising significant ethical concerns about LLM deployments in sensitive areas like mental health.
  • The authors highlight the need for robust and context-sensitive safety mechanisms within LLMs, advocating for comprehensive ethical testing and standards for AI safety to mitigate real-world risks.
  • The research underscores the challenges of establishing universal safety protocols in general-purpose LLMs due to the diverse and potentially malicious ways users can frame harmful queries.

💡 Why This Paper Matters

This paper is relevant and important because it addresses a critical safety issue in the deployment of LLMs, particularly in sensitive domains such as mental health. The findings reveal significant gaps in current safety mechanisms, highlighting the urgent need for improved adversarial testing and ethical guidelines to ensure the responsible use of AI technologies. The contributions made in this research lay the groundwork for developing more effective and empathetic AI systems capable of securely navigating high-stakes scenarios.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of particular interest to AI security researchers because it explores the vulnerabilities of LLMs in real-world, high-risk contexts using innovative adversarial methods. The implications of bypassing safety protocols in sensitive areas such as mental health can inform better security practices, help develop more resilient AI systems, and shape responsible deployment strategies to prevent future harms associated with AI misuse.

📚 Read the Full Paper