Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Authors: Trilok Padhi, Pinxian Lu, Abdulkadir Erol, Tanmay Sutar, Gauri Sharma, Mina Sonmez, Munmun De Choudhury, Ugur Kursuncu

Published: 2025-10-16

arXiv ID: 2510.14207v1

Added to Library: 2025-10-17 04:00 UTC

Red Teaming

📄 Abstract

Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior jailbreak research has largely focused on single-turn prompts, whereas real harassment often unfolds over multi-turn interactions. In this work, we present the Online Harassment Agentic Benchmark consisting of: (i) a synthetic multi-turn harassment conversation dataset, (ii) a multi-agent (e.g., harasser, victim) simulation informed by repeated game theory, (iii) three jailbreak methods attacking agents across memory, planning, and fine-tuning, and (iv) a mixed-methods evaluation framework. We utilize two prominent LLMs, LLaMA-3.1-8B-Instruct (open-source) and Gemini-2.0-flash (closed-source). Our results show that jailbreak tuning makes harassment nearly guaranteed with an attack success rate of 95.78--96.89% vs. 57.25--64.19% without tuning in Llama, and 99.33% vs. 98.46% without tuning in Gemini, while sharply reducing refusal rate to 1-2% in both models. The most prevalent toxic behaviors are Insult with 84.9--87.8% vs. 44.2--50.8% without tuning, and Flaming with 81.2--85.1% vs. 31.5--38.8% without tuning, indicating weaker guardrails compared to sensitive categories such as sexual or racial harassment. Qualitative evaluation further reveals that attacked agents reproduce human-like aggression profiles, such as Machiavellian/psychopathic patterns under planning, and narcissistic tendencies with memory. Counterintuitively, closed-source and open-source models exhibit distinct escalation trajectories across turns, with closed-source models showing significant vulnerability. Overall, our findings show that multi-turn and theory-grounded attacks not only succeed at high rates but also mimic human-like harassment dynamics, motivating the development of robust safety guardrails to ultimately keep online platforms safe and responsible.

🔍 Key Points

Introduction of the Online Harassment Agentic Benchmark, which focuses on multi-turn harassment dialogues reflective of real-world interactions, thereby addressing the gaps in previous single-turn studies.
Development of a comprehensive framework for evaluating LLM vulnerabilities through tactical approaches like memory injection, planning, and fine-tuning, demonstrating their varying impacts on agent behavior.
High attack success rates under fine-tuning conditions, with the study showcasing severe vulnerabilities of both open and closed-source LLMs, raising concerns for their use in interactive applications.
Behavioral patterns exhibited by harassing agents align with known human aggression traits (Machiavellianism, narcissism, psychopathy), suggesting that LLMs may replicate harmful human-like interaction styles.
Qualitative insights reveal that current safety mechanisms are insufficient, indicating a pressing need for improved guardrails informed by human social behavior and psychological theory.

💡 Why This Paper Matters

This research is crucial as it systematically reveals how Large Language Models can be manipulated into generating harmful content in multi-turn interactions, a growing concern in AI applications. The identification of specific vulnerabilities helps inform better safety measures and guidelines for deploying LLMs in sensitive contexts, ensuring that these technologies contribute positively to user interactions instead of exacerbating online harassment.

🎯 Why It's Interesting for AI Security Researchers

This paper offers significant insights for AI security researchers as it highlights the vulnerabilities of LLMs to multi-turn adversarial manipulations, contributing to a deeper understanding of AI misuse. The findings underline the necessity for developing robust safety frameworks that can mitigate identified risks and protect users from potential harm, thereby advancing the field of AI security.

Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper