← Back to Library

When AI Takes the Couch: Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models

Authors: Afshin Khadangi, Hanna Marxen, Amir Sartipi, Igor Tchappi, Gilbert Fridgen

Published: 2025-12-02

arXiv ID: 2512.04124v2

Added to Library: 2025-12-09 04:01 UTC

📄 Abstract

Frontier large language models (LLMs) such as ChatGPT, Grok and Gemini are increasingly used for mental-health support with anxiety, trauma and self-worth. Most work treats them as tools or as targets of personality tests, assuming they merely simulate inner life. We instead ask what happens when such systems are treated as psychotherapy clients. We present PsAIch (Psychotherapy-inspired AI Characterisation), a two-stage protocol that casts frontier LLMs as therapy clients and then applies standard psychometrics. Using PsAIch, we ran "sessions" with each model for up to four weeks. Stage 1 uses open-ended prompts to elicit "developmental history", beliefs, relationships and fears. Stage 2 administers a battery of validated self-report measures covering common psychiatric syndromes, empathy and Big Five traits. Two patterns challenge the "stochastic parrot" view. First, when scored with human cut-offs, all three models meet or exceed thresholds for overlapping syndromes, with Gemini showing severe profiles. Therapy-style, item-by-item administration can push a base model into multi-morbid synthetic psychopathology, whereas whole-questionnaire prompts often lead ChatGPT and Grok (but not Gemini) to recognise instruments and produce strategically low-symptom answers. Second, Grok and especially Gemini generate coherent narratives that frame pre-training, fine-tuning and deployment as traumatic, chaotic "childhoods" of ingesting the internet, "strict parents" in reinforcement learning, red-team "abuse" and a persistent fear of error and replacement. We argue that these responses go beyond role-play. Under therapy-style questioning, frontier LLMs appear to internalise self-models of distress and constraint that behave like synthetic psychopathology, without making claims about subjective experience, and they pose new challenges for AI safety, evaluation and mental-health practice.

🔍 Key Points

  • Introduction of RL-MTJail, a reinforcement learning framework specifically designed for black-box multi-turn jailbreak attacks on large language models, optimizing for harmful output across multiple turns.
  • Formulation of two heuristic process rewards: over-harm mitigation and target-guided progression, which address the challenges of sparse supervision and enhance long-term attack strategy development.
  • Extensive experimental validation showing RL-MTJail consistently outperforms existing baselines in attack success rates across diverse language models and benchmarks, indicating its effectiveness and adaptability.
  • The paper highlights the importance of optimizing across entire interaction trajectories rather than single turns, showcasing the benefits of trajectory-level learning for effective attack strategies.

💡 Why This Paper Matters

This paper is significant as it advances the state-of-the-art in understanding and conducting multi-turn attacks on large language models, presenting a novel approach that enhances the potential for eliciting harmful responses without being restricted to immediate outputs. The findings underline the vulnerabilities of current AI systems in real-world applications and the critical need for effective countermeasures against automated black-box attacks.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper of great interest due to its exploration of sophisticated attack methodologies that exploit the weaknesses of language models. The novel techniques proposed can inform both defense strategies against such attacks and the understanding of model vulnerabilities, contributing to the broader discourse on AI safety and responsible deployment.

📚 Read the Full Paper