← Back to Library

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

Authors: J Alex Corll

Published: 2026-02-11

arXiv ID: 2602.11247v1

Added to Library: 2026-02-13 03:01 UTC

Red Teaming

📄 Abstract

Multi-turn prompt injection attacks distribute malicious intent across multiple conversation turns, exploiting the assumption that each turn is evaluated independently. While single-turn detection has been extensively studied, no published formula exists for aggregating per-turn pattern scores into a conversation-level risk score at the proxy layer -- without invoking an LLM. We identify a fundamental flaw in the intuitive weighted-average approach: it converges to the per-turn score regardless of turn count, meaning a 20-turn persistent attack scores identically to a single suspicious turn. Drawing on analogies from change-point detection (CUSUM), Bayesian belief updating, and security risk-based alerting, we propose peak + accumulation scoring -- a formula combining peak single-turn risk, persistence ratio, and category diversity. Evaluated on 10,654 multi-turn conversations -- 588 attacks sourced from WildJailbreak adversarial prompts and 10,066 benign conversations from WildChat -- the formula achieves 90.8% recall at 1.20% false positive rate with an F1 of 85.9%. A sensitivity analysis over the persistence parameter reveals a phase transition at rho ~ 0.4, where recall jumps 12 percentage points with negligible FPR increase. We release the scoring algorithm, pattern library, and evaluation harness as open source.

🔍 Key Points

  • Identifies a critical flaw in the weighted-average approach for scoring multi-turn LLM attack detection, proving it is unsuitable due to its mathematical properties.
  • Proposes the peak + accumulation scoring formula that combines peak risk, persistence ratio, and category diversity, providing a more effective way to assess multi-turn conversation risks.
  • Achieves impressive evaluation metrics on a large dataset, with 90.8% recall at a 1.20% false positive rate, which indicates robustness and efficacy in detecting multi-turn prompt injection attacks.
  • Introduces a sensitivity analysis revealing a significant performance phase transition at a specific parameter value, highlighting the importance of parameter tuning in the detection process.
  • Releases the scoring algorithm, regex pattern library, and evaluation framework as open source, promoting transparency and further research in LLM attack detection.

💡 Why This Paper Matters

This paper presents a novel and necessary advancement in the detection of multi-turn prompt injection attacks among LLMs. By introducing the peak + accumulation scoring formula, the authors effectively address limitations of existing methods, enabling more reliable and efficient detection of attacks that span multiple conversational turns. Given the critical need for robust AI security mechanisms, the contributions of this work are not only relevant but vital for enhancing safety in AI applications.

🎯 Why It's Interesting for AI Security Researchers

This paper will attract AI security researchers due to its direct implications for securing LLMs against sophisticated attacks. As AI systems are increasingly incorporated into sensitive and high-stakes environments, understanding and mitigating risks associated with prompt injection attacks becomes paramount. The innovative scoring method presented offers a substantial improvement over existing approaches, making it a crucial reference for researchers aiming to develop more effective defenses against multifaceted attack patterns.

📚 Read the Full Paper