← Back to Library

GeneBreaker: Jailbreak Attacks against DNA Language Models with Pathogenicity Guidance

Authors: Zaixi Zhang, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang

Published: 2025-05-28

arXiv ID: 2505.23839v1

Added to Library: 2025-06-02 03:01 UTC

📄 Abstract

DNA, encoding genetic instructions for almost all living organisms, fuels groundbreaking advances in genomics and synthetic biology. Recently, DNA Foundation Models have achieved success in designing synthetic functional DNA sequences, even whole genomes, but their susceptibility to jailbreaking remains underexplored, leading to potential concern of generating harmful sequences such as pathogens or toxin-producing genes. In this paper, we introduce GeneBreaker, the first framework to systematically evaluate jailbreak vulnerabilities of DNA foundation models. GeneBreaker employs (1) an LLM agent with customized bioinformatic tools to design high-homology, non-pathogenic jailbreaking prompts, (2) beam search guided by PathoLM and log-probability heuristics to steer generation toward pathogen-like sequences, and (3) a BLAST-based evaluation pipeline against a curated Human Pathogen Database (JailbreakDNABench) to detect successful jailbreaks. Evaluated on our JailbreakDNABench, GeneBreaker successfully jailbreaks the latest Evo series models across 6 viral categories consistently (up to 60\% Attack Success Rate for Evo2-40B). Further case studies on SARS-CoV-2 spike protein and HIV-1 envelope protein demonstrate the sequence and structural fidelity of jailbreak output, while evolutionary modeling of SARS-CoV-2 underscores biosecurity risks. Our findings also reveal that scaling DNA foundation models amplifies dual-use risks, motivating enhanced safety alignment and tracing mechanisms. Our code is at https://github.com/zaixizhang/GeneBreaker.

🔍 Key Points

  • Identifies and formalizes the interplay between hallucinations and jailbreaks in large foundation models, proposing a unified optimization framework that interconnects these vulnerabilities.
  • Establishes two theoretical propositions about the similarity of loss convergence and gradient consistency in attention redistribution between hallucinations and jailbreaks.
  • Empirical validation of theoretical propositions on LLaVA-1.5 and MiniGPT-4 reveals that defenses for hallucinations can also mitigate jailbreaks and vice versa, demonstrating a shared failure mode in LFMs.
  • Provides insights into cross-domain mitigation strategies, suggesting practical approaches for enhancing the robustness of large foundation models against these vulnerabilities.

💡 Why This Paper Matters

This paper is pivotal in changing how vulnerabilities in large foundation models are understood and addressed. By connecting the dots between hallucinations and jailbreaks, it presents a holistic view that enhances strategies for robustness against adversarial manipulation and misalignment, crucial for developing safer AI systems.

🎯 Why It's Interesting for AI Security Researchers

The findings within this paper are significant for AI security researchers because they provide a novel theoretical and empirical framework for understanding and mitigating vulnerabilities in large foundation models. The ability to simultaneously address multiple vulnerabilities through shared mitigation strategies presents promising avenues for improving the safety and reliability of AI systems in real-world applications.

📚 Read the Full Paper