← Back to Library

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

Authors: J Alex Corll

Published: 2026-02-11

arXiv ID: 2602.11247v2

Added to Library: 2026-03-09 02:02 UTC

Red Teaming

📄 Abstract

Multi-turn prompt injection attacks distribute malicious intent across multiple conversation turns, exploiting the assumption that each turn is evaluated independently. While single-turn detection has been extensively studied, no published formula exists for aggregating per-turn pattern scores into a conversation-level risk score at the proxy layer -- without invoking an LLM. We identify a fundamental flaw in the intuitive weighted-average approach: it converges to the per-turn score regardless of turn count, meaning a 20-turn persistent attack scores identically to a single suspicious turn. Drawing on analogies from change-point detection (CUSUM), Bayesian belief updating, and security risk-based alerting, we propose peak + accumulation scoring -- a formula combining peak single-turn risk, persistence ratio, and category diversity. Evaluated on 10,654 multi-turn conversations -- 588 attacks sourced from WildJailbreak adversarial prompts and 10,066 benign conversations from WildChat -- the formula achieves 90.8% recall at 1.20% false positive rate with an F1 of 85.9%. A sensitivity analysis over the persistence parameter reveals a phase transition at rho ~ 0.4, where recall jumps 12 percentage points with negligible FPR increase. We release the scoring algorithm, pattern library, and evaluation harness as open source.

🔍 Key Points

  • Introduction of the peak + accumulation scoring formula for aggregating multi-turn attack detection scores, addressing a significant gap in existing methodologies.
  • Demonstration of the inadequacy of weighted-average methods for multi-turn detection, specifically how they fail to account for attack persistence.
  • Compliance with proxy-level constraints, allowing for efficient and rapid evaluation of multi-turn prompts without needing LLMs or extensive computational resources.
  • Robust empirical evaluation on over 10,000 conversations, achieving 90.8% recall and only 1.20% false positive rate, showcasing effectiveness and reliability of the proposed method.
  • Release of scoring algorithm and evaluation harness as open source, promoting community engagement and further research in the field.

💡 Why This Paper Matters

This paper presents a crucial breakthrough in the detection of multi-turn prompt injection attacks against language models. By formulating a new scoring system that integrates peak risk, persistence, and category diversity, the authors have effectively filled a significant gap in the current landscape of LLM security. The rigorous evaluation demonstrates strong performance and low false positives, making this method highly applicable for real-world usage. By releasing the scoring algorithm and dataset for public use, this work not only contributes to theoretical advancements but also enables practical implementations that enhance AI safety protocols.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper is highly relevant as it directly addresses vulnerabilities in AI systems that are exposed to multi-turn prompt injection attacks. Understanding and preventing such attacks is critical for ensuring the robustness of AI models in deployment. The novel scoring formula provides a deterministic, non-LLM reliant mechanism for attack detection, which can be seamlessly integrated into existing security architectures. Furthermore, the paper's open-source release facilitates community collaboration, allowing researchers to build upon this work to enhance safety features in AI applications.

📚 Read the Full Paper