← Back to Library

Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak

Authors: Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin

Published: 2026-01-01

arXiv ID: 2601.00213v1

Added to Library: 2026-01-07 10:05 UTC

Red Teaming Safety

📄 Abstract

The widespread deployment of large language models (LLMs) has raised growing concerns about their misuse risks and associated safety issues. While prior studies have examined the safety of LLMs in general usage, code generation, and agent-based applications, their vulnerabilities in automated algorithm design remain underexplored. To fill this gap, this study investigates this overlooked safety vulnerability, with a particular focus on intelligent optimization algorithm design, given its prevalent use in complex decision-making scenarios. We introduce MalOptBench, a benchmark consisting of 60 malicious optimization algorithm requests, and propose MOBjailbreak, a jailbreak method tailored for this scenario. Through extensive evaluation of 13 mainstream LLMs including the latest GPT-5 and DeepSeek-V3.1, we reveal that most models remain highly susceptible to such attacks, with an average attack success rate of 83.59% and an average harmfulness score of 4.28 out of 5 on original harmful prompts, and near-complete failure under MOBjailbreak. Furthermore, we assess state-of-the-art plug-and-play defenses that can be applied to closed-source models, and find that they are only marginally effective against MOBjailbreak and prone to exaggerated safety behaviors. These findings highlight the urgent need for stronger alignment techniques to safeguard LLMs against misuse in algorithm design.

🔍 Key Points

  • Introduction of MalOptBench, a benchmark featuring 60 malicious optimization algorithm requests, dedicated to testing LLM vulnerabilities in algorithm design contexts.
  • Development of MOBjailbreak, a novel jailbreak method that exploits LLM weaknesses to bypass safety measures effectively and induce harmful outputs.
  • Experimental results show that 13 mainstream LLMs, including GPT-5, exhibit an average attack success rate of 83.59% and harmfulness scores averaging 4.28 out of 5 when faced with original malicious prompts.
  • Discovery that existing plug-and-play defenses provide marginal protection against MOBjailbreak, often exacerbating safety issues by producing exaggerated safety behaviors.
  • Highlighting the urgent need for new alignment techniques and improved safety tracking within LLMs, particularly in the realm of intelligent optimization and algorithm generation.

💡 Why This Paper Matters

This paper is crucial as it addresses a previously neglected safety vulnerability within large language models, specifically targeting how these models can be misused in algorithm design. By establishing a benchmark and demonstrating significant vulnerabilities, it underscores the importance of advancing safety measures to prevent potential exploitation in decision-making processes across various sectors.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper compelling as it not only reveals systemic flaws in the current safety protocols of LLMs but also proposes a means to address these vulnerabilities. It emphasizes the intersection of AI model deployment and ethical considerations, making it a critical resource for those aiming to enhance model safety and mitigate potential misuse.

📚 Read the Full Paper