← Back to Library

Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

Authors: Haochuan Kevin Wang

Published: 2026-03-30

arXiv ID: 2603.28013v2

Added to Library: 2026-04-06 02:06 UTC

Red Teaming

📄 Abstract

We present a stage-decomposed analysis of prompt injection attacks against five frontier LLM agents. Prior work measures task-level attack success rate (ASR); we localize the pipeline stage at which each model's defense activates. We instrument every run with a cryptographic canary token (SECRET-[A-F0-9]{8}) tracked through four kill-chain stages -- Exposed, Persisted, Relayed, Executed -- across four attack surfaces and five defense conditions (764 total runs, 428 no-defense attacked). Our central finding is that model safety is determined not by whether adversarial content is seen, but by whether it is propagated across pipeline stages. Concretely: (1) in our evaluation, exposure is 100% for all five models -- the safety gap is entirely downstream; (2) Claude strips injections at write_memory summarization (0/164 ASR), while GPT-4o-mini propagates canaries without loss (53% ASR, 95% CI: 41--65%); (3) DeepSeek exhibits 0% ASR on memory surfaces and 100% ASR on tool-stream surfaces from the same model -- a complete reversal across injection channels; (4) all four active defense conditions (write_filter, pi_detector, spotlighting, and their combination) produce 100% ASR due to threat-model surface mismatch; (5) a Claude relay node decontaminates downstream agents -- 0/40 canaries survived into shared memory.

🔍 Key Points

  • The paper introduces a stage-level evaluation framework using a 'kill-chain canary' methodology that attributes defense effectiveness to specific stages in the pipeline rather than just the final outcome, which enhances the understanding of prompt injection vulnerabilities.
  • The authors demonstrate that model safety for prompt injection attacks is primarily determined by its behavior at different pipeline stages, revealing that exposure to injections occurs universally (100% for all models) but safety depends on downstream handling of these injections.
  • Claude's defense mechanism successfully eliminates injections at the write_memory summarization stage (0% ASR), while other models like GPT-4o-mini propagate injections effectively, presenting a stark contrast in safety mechanisms based on architecture.
  • Surface mismatch is identified as a crucial structural flaw in defenses, leading to failures when the defenses are not designed to address the specific attack surfaces tested, challenging existing assumptions about defense effectiveness in model training and evaluation.
  • The study calls for incorporating write-node identity as a fundamental architectural choice in multi-agent systems to improve safety by ensuring that hazardous prompts cannot propagate downstream.

💡 Why This Paper Matters

This paper is crucial in advancing the field of AI security, particularly in the context of prompt injection attacks against large language models. By shifting the focus from model capabilities to the architectural design of agent pipelines, it offers novel insights into how adversarial content can be effectively managed or mitigated. The findings underscore the importance of evaluating defenses across diverse surfaces and coalition of models rather than relying on singular outcomes, thus enhancing security assessments in practical applications.

🎯 Why It's Interesting for AI Security Researchers

This paper represents significant progress in understanding and addressing the vulnerabilities of AI systems to prompt injection attacks, which are increasingly relevant as LLMs are integrated into real-world applications. AI security researchers will find the paper's methodology, practical implications for defense design, and exploration of pipeline architecture vital to developing robust countermeasures against adversarial manipulation in AI agents.

📚 Read the Full Paper