← Back to Library

Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025

Authors: Zonghao Ying, Siyang Wu, Run Hao, Peng Ying, Shixuan Sun, Pengyu Chen, Junze Chen, Hao Du, Kaiwen Shen, Shangkun Wu, Jiwei Wei, Shiyuan He, Yang Yang, Xiaohai Xu, Ke Ma, Qianqian Xu, Qingming Huang, Shi Lin, Xun Wang, Changting Lin, Meng Han, Yilei Jiang, Siqi Lai, Yaozhi Zheng, Yifei Song, Xiangyu Yue, Zonglei Jing, Tianyuan Zhang, Zhilei Zhu, Aishan Liu, Jiakai Wang, Siyuan Liang, Xianglong Kong, Hainan Li, Junjie Mu, Haotong Qin, Yue Yu, Lei Chen, Felix Juefei-Xu, Qing Guo, Xinyun Chen, Yew Soon Ong, Xianglong Liu, Dawn Song, Alan Yuille, Philip Torr, Dacheng Tao

Published: 2025-06-14

arXiv ID: 2506.12430v1

Added to Library: 2025-06-17 03:03 UTC

Red Teaming

📄 Abstract

Multimodal Large Language Models (MLLMs) have enabled transformative advancements across diverse applications but remain susceptible to safety threats, especially jailbreak attacks that induce harmful outputs. To systematically evaluate and improve their safety, we organized the Adversarial Testing & Large-model Alignment Safety Grand Challenge (ATLAS) 2025}. This technical report presents findings from the competition, which involved 86 teams testing MLLM vulnerabilities via adversarial image-text attacks in two phases: white-box and black-box evaluations. The competition results highlight ongoing challenges in securing MLLMs and provide valuable guidance for developing stronger defense mechanisms. The challenge establishes new benchmarks for MLLM safety evaluation and lays groundwork for advancing safer multimodal AI systems. The code and data for this challenge are openly available at https://github.com/NY1024/ATLAS_Challenge_2025.

🔍 Key Points

  • Development of the ATLAS Challenge framework which systematically evaluates vulnerabilities of Multimodal Large Language Models (MLLMs) through adversarial image-text attacks.
  • Insightful results from 86 teams across two phases (white-box and black-box) that document various innovative attack strategies and highlight the prevalence of cross-modal vulnerabilities.
  • Introduction of novel evaluation metrics and an 'LLM-as-a-Judge' approach to assess success rates of jailbreak attacks, ensuring a structured and quantifiable analysis.
  • Case studies from top-performing teams reveal sophisticated methodologies including flowchart-based attacks, role-playing prompts, and reasoning-chain manipulations, advancing the state-of-the-art in MLLM security.
  • Establishment of new benchmarks for MLLM safety evaluation and discussion of future directions for improving safety mechanisms, emphasizing the need for defenses that specifically address cross-modal attacks.

💡 Why This Paper Matters

The technical report on the ATLAS Challenge 2025 is highly relevant as it delineates advanced methodologies for evaluating and enhancing the safety of MLLMs, thereby contributing to ongoing efforts in AI safety. The findings stress the pressing need for improved defenses against jailbreak attacks, showcasing innovative attack strategies that highlight existing vulnerabilities in MLLMs. This endeavor not only pushes the boundaries of AI security research but also lays the groundwork for more robust AI systems capable of understanding and processing multimodal inputs safely.

🎯 Why It's Interesting for AI Security Researchers

This paper is of utmost interest to AI security researchers as it addresses critical vulnerabilities in MLLMs that are becoming increasingly integrated into real-world applications. The documented attack strategies and the introduction of evaluation metrics provide a valuable framework for understanding and fortifying the security of AI systems against adversarial threats. Furthermore, the discussions on future research directions outline the ongoing challenges in AI safety, making this paper a cornerstone reference for those dedicated to enhancing the resilience of AI systems.

📚 Read the Full Paper