← Back to Library

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

Authors: Sizhe Chen, Arman Zharmagambetov, David Wagner, Chuan Guo

Published: 2025-07-03

arXiv ID: 2507.02735v2

Added to Library: 2025-11-11 14:22 UTC

📄 Abstract

Prompt injection attack has been listed as the top-1 security threat to LLM-integrated applications, which interact with external environment data for complex tasks. The untrusted data may contain an injected prompt trying to arbitrarily manipulate the system. Model-level prompt injection defenses have shown strong effectiveness, but are currently deployed into commercial-grade models in a closed-source manner. We believe open-source secure models are needed by the AI security community, where co-development of attacks and defenses through open research drives scientific progress in mitigating prompt injection attacks. To this end, we develop Meta SecAlign, the first fully open-source LLM with built-in model-level defense that achieves commercial-grade performance, powerful enough for complex agentic tasks. We provide complete details of our training recipe, an improved version of the SOTA SecAlign defense. We perform the most comprehensive evaluation to date on 9 utility benchmarks and 7 security benchmarks on general knowledge, instruction following, and agentic workflows. Results show that Meta SecAlign, despite being trained on generic instruction-tuning samples only, surprisingly confers security in unseen downstream tasks, including tool-calling and web-navigation, in addition to general instruction-following. Our best model -- Meta-SecAlign-70B -- establishes a new frontier of utility-security trade-off for open-source LLMs. Even compared to closed-course commercial models such as GPT-5, our model is much securer than most of them. Below are links for the code (https://github.com/facebookresearch/Meta_SecAlign), Meta-SecAlign-70B(https://huggingface.co/facebook/Meta-SecAlign-70B), and Meta-SecAlign-8B(https://huggingface.co/facebook/Meta-SecAlign-8B) models.

🔍 Key Points

  • Introduction of an activation-guided prompt injection attack framework that improves the performance of black-box attacks on LLMs.
  • Development of an Energy-based Model (EBM) that evaluates adversarial prompts based on internal activations of a surrogate model, allowing for optimized adversarial prompt generation without querying the victim model.
  • Utilization of token-level Markov Chain Monte Carlo (MCMC) sampling to effectively generate diverse adversarial prompts while maintaining naturalness and interpretability.
  • Demonstration of superior transferability with 49.6% attack success rate (ASR) across five mainstream LLMs, along with high performance on unseen task scenarios.
  • Interpretability analysis corroborates that prompt effectiveness is strongly associated with specific activation patterns, enhancing understanding of prompt injection vulnerabilities.

💡 Why This Paper Matters

This paper presents a significant advancement in the security analysis of Large Language Models (LLMs) by addressing the critical threat of direct prompt injection attacks. The proposed method improves attack success rates and robustness against various models and settings, thereby contributing to the ongoing research on LLM vulnerabilities and the necessity of effective security measures.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper particularly compelling as it tackles pressing challenges in the domain of LLM security. The novel approaches to prompt injection attacks, coupled with empirical results demonstrating effective transferability, offer valuable insights for strengthening adversarial resilience and inform the design of future defenses against such vulnerabilities.

📚 Read the Full Paper