← Back to Library

Fingerprinting LLMs via Prompt Injection

Authors: Yuepeng Hu, Zhengyuan Jiang, Mengyuan Li, Osama Ahmed, Zhicong Huang, Cheng Hong, Neil Gong

Published: 2025-09-29

arXiv ID: 2509.25448v2

Added to Library: 2025-11-11 14:25 UTC

📄 Abstract

Large language models (LLMs) are often modified after release through post-processing such as post-training or quantization, which makes it challenging to determine whether one model is derived from another. Existing provenance detection methods have two main limitations: (1) they embed signals into the base model before release, which is infeasible for already published models, or (2) they compare outputs across models using hand-crafted or random prompts, which are not robust to post-processing. In this work, we propose LLMPrint, a novel detection framework that constructs fingerprints by exploiting LLMs' inherent vulnerability to prompt injection. Our key insight is that by optimizing fingerprint prompts to enforce consistent token preferences, we can obtain fingerprints that are both unique to the base model and robust to post-processing. We further develop a unified verification procedure that applies to both gray-box and black-box settings, with statistical guarantees. We evaluate LLMPrint on five base models and around 700 post-trained or quantized variants. Our results show that LLMPrint achieves high true positive rates while keeping false positive rates near zero.

🔍 Key Points

  • Introduction of the Optimization-based Evaluation Toolkit (OET) for benchmarking prompt injection attacks and defenses against large language models (LLMs).
  • OET provides a modular framework that supports adaptive adversarial testing using both white-box and black-box optimization methods, allowing researchers to systematically evaluate and improve model robustness.
  • Extensive evaluations reveal a significant susceptibility of open-source LLMs to adversarial attacks, highlighting the limitations of existing defense mechanisms across various datasets and domains.
  • The toolkit enables customized implementation of new attack strategies, fostering exploration of diverse defense methods and their real-world applications.

💡 Why This Paper Matters

This paper presents a critical advancement in the evaluation of adversarial robustness in LLMs through the development of OET. By addressing the gaps in current evaluation frameworks, OET enables researchers to rigorously assess defenses against prompt injection attacks, ultimately contributing to a more secure application of LLMs in various sectors.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant for AI security researchers as it provides a comprehensive evaluation framework that not only benchmarks existing defenses but also encourages the development of new ones. The insights gained from the OET toolkit can significantly impact the understanding of adversarial vulnerabilities in LLMs, making it a valuable resource for enhancing the security of AI systems.

📚 Read the Full Paper