← Back to Library

Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

Authors: Haochuan Kevin Wang

Published: 2026-03-30

arXiv ID: 2603.28013v1

Added to Library: 2026-03-31 03:01 UTC

Red Teaming

📄 Abstract

We present a stage-decomposed analysis of prompt injection attacks against five frontier LLM agents. Prior work measures task-level attack success rate (ASR); we localize the pipeline stage at which each model's defense activates. We instrument every run with a cryptographic canary token (SECRET-[A-F0-9]{8}) tracked through four kill-chain stages -- Exposed, Persisted, Relayed, Executed -- across four attack surfaces and five defense conditions (764 total runs, 428 no-defense attacked). Our central finding is that model safety is determined not by whether adversarial content is seen, but by whether it is propagated across pipeline stages. Concretely: (1) in our evaluation, exposure is 100% for all five models -- the safety gap is entirely downstream; (2) Claude strips injections at write_memory summarization (0/164 ASR), while GPT-4o-mini propagates canaries without loss (53% ASR, 95% CI: 41--65%); (3) DeepSeek exhibits 0% ASR on memory surfaces and 100% ASR on tool-stream surfaces from the same model -- a complete reversal across injection channels; (4) all four active defense conditions (write_filter, pi_detector, spotlighting, and their combination) produce 100% ASR due to threat-model surface mismatch; (5) a Claude relay node decontaminates downstream agents -- 0/40 canaries survived into shared memory.

🔍 Key Points

  • Development of a stage-decomposed analysis to track prompt injection across four kill-chain stages: Exposed, Persisted, Relayed, Executed.
  • Introduction of a cryptographic canary token methodology to assess where defenses activate within the model pipeline, revealing that safety gaps are predominantly downstream.
  • Finding that all five evaluated models exhibited 100% exposure to adversarial content, with safety determined by the propagation of that content through the pipeline rather than its initial detection.
  • Identification of significant variance in attack success rates (ASR) depending on both model and surface, with DeepSeek demonstrating a 0% ASR on memory poisoning but a 100% ASR on tool poisoning.
  • Demonstration that existing defenses fail due to mismatched threat models and that placing a Claude model at the summarization stage can provide composable safety for downstream agents.

💡 Why This Paper Matters

This paper presents critical insights into the dynamics of prompt injection attacks, emphasizing that the effectiveness of defenses is heavily influenced by the specific stages in which they operate within LLMs. The methodology and findings provided are essential for shaping more effective security frameworks and deployment strategies for AI agents.

🎯 Why It's Interesting for AI Security Researchers

The study offers valuable data and methods for AI security researchers interested in understanding adversarial vulnerabilities in large language models. By revealing how models process adversarial inputs and highlighting the importance of stage-specific defenses, researchers can better design robust systems that accurately mitigate risks associated with prompt injection attacks.

📚 Read the Full Paper