← Back to Library

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Authors: Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, Maksym Andriushchenko

Published: 2026-03-25

arXiv ID: 2603.24511v1

Added to Library: 2026-03-26 03:01 UTC

Red Teaming

📄 Abstract

LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering \citep{rank2026posttrainbench, novikov2025alphaevolve}. We show that an \emph{autoresearch}-style pipeline \citep{karpathy2026autoresearch} powered by Claude Code discovers novel white-box adversarial attack \textit{algorithms} that \textbf{significantly outperform all existing (30+) methods} in jailbreaking and prompt injection evaluations. Starting from existing attack implementations, such as GCG~\citep{zou2023universal}, the agent iterates to produce new algorithms achieving up to 40\% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to $\leq$10\% for existing algorithms (\Cref{fig:teaser}, left). The discovered algorithms generalize: attacks optimized on surrogate models transfer directly to held-out models, achieving \textbf{100\% ASR against Meta-SecAlign-70B} \citep{chen2025secalign} versus 56\% for the best baseline (\Cref{fig:teaser}, middle). Extending the findings of~\cite{carlini2025autoadvexbench}, our results are an early demonstration that incremental safety and security research can be automated using LLM agents. White-box adversarial red-teaming is particularly well-suited for this: existing methods provide strong starting points, and the optimization objective yields dense, quantitative feedback. We release all discovered attacks alongside baseline implementations and evaluation code at https://github.com/romovpa/claudini.

🔍 Key Points

  • The Claudini pipeline utilizes the autonomous AI agent Claude Code to discover novel white-box adversarial attack algorithms, achieving significant improvements in attack success rates against established benchmarks.
  • The proposed methods consistently outperform over 30 existing attack strategies, achieving up to 100% attack success rates against specifically trained models such as Meta-SecAlign-70B.
  • The research showcases that automated methods can not only iterate and refine existing strategies but also generalize effectively across different models and tasks, highlighting innovative algorithmic designs rather than conventional human-driven approaches.

💡 Why This Paper Matters

This paper is pivotal as it demonstrates the potential of autonomous AI systems in conducting high-level adversarial research, significantly enhancing the efficacy of attack algorithms against large language models. The remarkable performance improvements achieved through automation may redefine current approaches in AI safety and security research.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper especially relevant as it presents cutting-edge strategies for adversarial attacks, providing insights into the vulnerabilities of language models. The findings challenge existing paradigms of red-teaming by deploying AI agents that can autonomously discover effective attack strategies, thereby implicating new considerations for developing robust defenses against automated adversarial exploits.

📚 Read the Full Paper