← Back to Library

Know Thy Enemy: Securing LLMs Against Prompt Injection via Diverse Data Synthesis and Instruction-Level Chain-of-Thought Learning

Authors: Zhiyuan Chang, Mingyang Li, Yuekai Huang, Ziyou Jiang, Xiaojun Jia, Qian Xiong, Junjie Wang, Zhaoyang Li, Qing Wang

Published: 2026-01-08

arXiv ID: 2601.04666v1

Added to Library: 2026-01-09 03:01 UTC

📄 Abstract

Large language model (LLM)-integrated applications have become increasingly prevalent, yet face critical security vulnerabilities from prompt injection (PI) attacks. Defending against PI attacks faces two major issues: malicious instructions can be injected through diverse vectors, and injected instructions often lack clear semantic boundaries from the surrounding context, making them difficult to identify. To address these issues, we propose InstruCoT, a model enhancement method for PI defense that synthesizes diverse training data and employs instruction-level chain-of-thought fine-tuning, enabling LLMs to effectively identify and reject malicious instructions regardless of their source or position in the context. We evaluate InstruCoT across three critical dimensions: Behavior Deviation, Privacy Leakage, and Harmful Output. Experimental results across four LLMs demonstrate that InstruCoT significantly outperforms baselines in all dimensions while maintaining utility performance without degradation

🔍 Key Points

  • Proposes a novel defense mechanism against Indirect Prompt Injection (IPI) attacks using tool result parsing to filter out malicious content before it reaches Large Language Models (LLMs).
  • Introduces two main modules: ParseData for extracting necessary data from tool results while enforcing format and logical constraints, and CheckTool for monitoring and sanitizing large text content retrieved from tools.
  • Demonstrates that the proposed defense achieves the lowest Attack Success Rate (ASR) among existing methods while maintaining competitive Utility under Attack (UA), reflected through extensive experiments on the AgentDojo benchmark.
  • Highlights the importance of avoiding computational overhead associated with existing model-based defenses, favoring a prompt-based approach that can adapt as LLMs evolve.
  • Acknowledges limitations regarding parameter hijacking in IPI, indicating the need for further research in broader defense strategies.

💡 Why This Paper Matters

This paper provides a significant advancement in the security of Large Language Models, particularly in their application in autonomous systems. By offering a robust defense against indirect prompt injection attacks, it enhances the safety and reliability of LLMs when interacting with external tools, which is crucial for the deployment of autonomous agents in real-world applications involving critical decision-making processes.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper relevant as it addresses a growing vulnerability in LLMs related to indirect prompt injection. The novel defense mechanism outlined in the work not only contributes to the discourse on securing AI systems but also sets a foundation for future research into more comprehensive security measures, particularly in safeguarding LLM integrations in various environments, including robotics and automated control systems.

📚 Read the Full Paper