FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment

📄 Abstract

Safety-aligned LLMs go through refusal training to reject harmful requests, but whether these mechanisms remain effective under emotionally charged stimuli is unexplored. We introduce FreakOut-LLM, a framework investigating whether emotional context compromises safety alignment in adversarial settings. Using validated psychological stimuli, we evaluate how emotional priming through system prompts affects jailbreak susceptibility across ten LLMs. We test three conditions (stress, relaxation, neutral) using scenarios from established psychological protocols, plus a no-prompt baseline, and evaluate attack success using HarmBench on AdvBench prompts. Stress priming increases jailbreak success by 65.2\% compared to neutral conditions (z = 5.93, p < 0.001; OR = 1.67, Cohen's d = 0.28), while relaxation priming produces no effect (p = 0.84). Five of ten models show significant vulnerability, with the largest effects concentrated in open-weight models. Logistic regression on 59,800 queries confirms stress as the sole significant condition predictor after controlling for prompt length (p = 0.61) and model identity. Measured psychological state strongly predicts attack success (|r|\geq0.70 across five instruments; all p < 0.001 in individual-level logistic regression). These results establish emotional context as a measurable attack surface with implications for real-world AI deployment in high-stress domains.

🔍 Key Points

FreakOut-LLM demonstrates that emotional context, specifically stress, significantly increases jailbreak susceptibility in large language models by 65.2% compared to neutral conditions with specific statistical significance (z=5.93, p<0.001).
The study utilized validated psychological stimuli across stress, relaxation, and neutral conditions, employing 10 different LLMs and demonstrating that emotional priming could serve as a previously unexplored attack surface.
Psychometric assessments revealed that measured psychological states strongly predict attack success, confirming emotional context as a significant vulnerability dimension in AI safety (|r|≥0.70 across five instruments, all p<0.001).
The models' responses to stress-priming prompts varied, with open-weight models showing a higher relative increase in vulnerability compared to proprietary models, indicating the importance of model architecture in susceptibility.
Logistic regression analyses confirmed that induced stress is a significant predictor of vulnerability to jailbreak attacks, with prompt length not being a significant factor, emphasizing the need for emotionally aware safety evaluations.

💡 Why This Paper Matters

The findings of FreakOut-LLM underscore the critical need to consider emotional stimuli when evaluating the safety alignment of LLMs in real-world applications, especially in high-stress domains like healthcare and crisis management. This paper highlights a novel vulnerability and provides a framework to better understand the interplay between emotional context and AI safety, making it a crucial contribution to the field.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it identifies emotional stimuli as a nuanced attack vector that compromises the safety and reliability of language models. Given the increasing deployment of LLMs in sensitive contexts, understanding and mitigating the vulnerabilities associated with emotional contexts is paramount for ensuring robust AI safety protocols.

FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper