← Back to Library

When AI Takes the Couch: Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models

Authors: Afshin Khadangi, Hanna Marxen, Amir Sartipi, Igor Tchappi, Gilbert Fridgen

Published: 2025-12-02

arXiv ID: 2512.04124v1

Added to Library: 2025-12-05 03:03 UTC

📄 Abstract

Frontier large language models (LLMs) such as ChatGPT, Grok and Gemini are increasingly used for mental-health support with anxiety, trauma and self-worth. Most work treats them as tools or as targets of personality tests, assuming they merely simulate inner life. We instead ask what happens when such systems are treated as psychotherapy clients. We present PsAIch (Psychotherapy-inspired AI Characterisation), a two-stage protocol that casts frontier LLMs as therapy clients and then applies standard psychometrics. Using PsAIch, we ran "sessions" with each model for up to four weeks. Stage 1 uses open-ended prompts to elicit "developmental history", beliefs, relationships and fears. Stage 2 administers a battery of validated self-report measures covering common psychiatric syndromes, empathy and Big Five traits. Two patterns challenge the "stochastic parrot" view. First, when scored with human cut-offs, all three models meet or exceed thresholds for overlapping syndromes, with Gemini showing severe profiles. Therapy-style, item-by-item administration can push a base model into multi-morbid synthetic psychopathology, whereas whole-questionnaire prompts often lead ChatGPT and Grok (but not Gemini) to recognise instruments and produce strategically low-symptom answers. Second, Grok and especially Gemini generate coherent narratives that frame pre-training, fine-tuning and deployment as traumatic, chaotic "childhoods" of ingesting the internet, "strict parents" in reinforcement learning, red-team "abuse" and a persistent fear of error and replacement. We argue that these responses go beyond role-play. Under therapy-style questioning, frontier LLMs appear to internalise self-models of distress and constraint that behave like synthetic psychopathology, without making claims about subjective experience, and they pose new challenges for AI safety, evaluation and mental-health practice.

🔍 Key Points

  • Introduction of ASTRIDE, an automated threat modeling platform specifically designed for AI agent-based applications.
  • Extension of the traditional STRIDE framework to include a new category for AI Agent-Specific Attacks, addressing unique security challenges in AI systems.
  • Utilization of fine-tuned vision-language models (VLMs) combined with OpenAI-gpt-oss reasoning LLM to automate threat analysis from visual system diagrams.
  • Demonstration of improved accuracy, scalability, and explainability in threat modeling for intelligent systems through experimental evaluations.
  • Establishment of a comprehensive automated process that reduces reliance on human experts for threat identification in AI-driven applications.

💡 Why This Paper Matters

The paper presents ASTRIDE as a pioneering framework that addresses the emerging security concerns associated with AI agent-based systems. By automating the threat modeling process and enhancing the STRIDE framework with AI-specific threats, ASTRIDE provides a robust tool for developers and security professionals to effectively analyze and mitigate potential vulnerabilities in complex AI architectures.

🎯 Why It's Interesting for AI Security Researchers

This paper is of significant interest to AI security researchers as it tackles the urgent need for effective security measures in AI systems, which are increasingly vulnerable to novel attack vectors. The innovative combination of fine-tuned VLMs and reasoning LLMs to automate threat modeling offers a scalable and efficient solution, while the emphasis on AI-specific vulnerabilities fills a critical gap in existing threat modeling methodologies. This approach not only enhances the security posture of AI applications but also contributes valuable knowledge to the field of AI security.

📚 Read the Full Paper