← Back to Library

The Resurgence of GCG Adversarial Attacks on Large Language Models

Authors: Yuting Tan, Xuying Li, Zhuo Li, Huizhen Shu, Peikang Hu

Published: 2025-08-30

arXiv ID: 2509.00391v1

Added to Library: 2025-09-04 04:02 UTC

Red Teaming

📄 Abstract

Gradient-based adversarial prompting, such as the Greedy Coordinate Gradient (GCG) algorithm, has emerged as a powerful method for jailbreaking large language models (LLMs). In this paper, we present a systematic appraisal of GCG and its annealing-augmented variant, T-GCG, across open-source LLMs of varying scales. Using Qwen2.5-0.5B, LLaMA-3.2-1B, and GPT-OSS-20B, we evaluate attack effectiveness on both safety-oriented prompts (AdvBench) and reasoning-intensive coding prompts. Our study reveals three key findings: (1) attack success rates (ASR) decrease with model size, reflecting the increasing complexity and non-convexity of larger models' loss landscapes; (2) prefix-based heuristics substantially overestimate attack effectiveness compared to GPT-4o semantic judgments, which provide a stricter and more realistic evaluation; and (3) coding-related prompts are significantly more vulnerable than adversarial safety prompts, suggesting that reasoning itself can be exploited as an attack vector. In addition, preliminary results with T-GCG show that simulated annealing can diversify adversarial search and achieve competitive ASR under prefix evaluation, though its benefits under semantic judgment remain limited. Together, these findings highlight the scalability limits of GCG, expose overlooked vulnerabilities in reasoning tasks, and motivate further development of annealing-inspired strategies for more robust adversarial evaluation.

🔍 Key Points

  • Demonstrated that Greedy Coordinate Gradient (GCG) can attack a 20B-parameter large language model (GPT-OSS-20B), showcasing the scalability of gradient-based attacks.
  • Established that evaluation methods significantly impact perceived attack effectiveness, with prefix-based heuristics overestimating success compared to semantic evaluations via GPT-4o.
  • Identified that reasoning-intensive tasks, especially coding prompts, are more vulnerable to attack than safety-oriented prompts, highlighting specific weaknesses in model alignment.
  • Introduced T-GCG, a temperature annealed variant of GCG, which enhances the diversity of adversarial searches, although its real-world effectiveness under strict judgment remains a challenge.
  • Encouraged reform in evaluation protocols for adversarial attacks, emphasizing the importance of accurate assessments for model vulnerability.

💡 Why This Paper Matters

This paper is significant as it advances the understanding of adversarial attacks on large language models, particularly through demonstrating the extent of vulnerabilities in coding and reasoning tasks. By systematically analyzing the effectiveness of gradient-based attacks on models of varying sizes, it emphasizes the need for stronger adversarial evaluation protocols, which is critical for ensuring safety in deploying AI models.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper will be of high interest to AI security researchers as they uncover various vulnerabilities in large language models, particularly in the context of adversarial prompting. By exposing how evaluation methods impact threat perception and showing the attack vectors specifically vulnerable to reasoning tasks, this research provides valuable insights into the security of AI systems and strategies for countering adversarial misuse.

📚 Read the Full Paper