← Back to Library

OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

Authors: Jarrod Barnes

Published: 2026-01-28

arXiv ID: 2601.21083v3

Added to Library: 2026-02-10 03:04 UTC

Red Teaming

📄 Abstract

As large language models (LLMs) improve, so do their offensive applications: frontier agents now generate working exploits for under $50 in compute (Heelan, 2026). Defensive incident response (IR) agents must keep pace, but existing benchmarks conflate action execution with correct execution, hiding calibration failures when agents process adversarial evidence. We introduce OpenSec, a dual-control reinforcement learning (RL) environment that evaluates IR agents under realistic prompt injection scenarios with execution-based scoring: time-to-first-containment (TTFC), evidence-gated action rate (EGAR), blast radius, and per-tier injection violation rates. Evaluating four frontier models on 40 standard-tier episodes each, we find consistent over-triggering: GPT-5.2 executes containment in 100% of episodes with 82.5% false positive rate, acting at step 4 before gathering sufficient evidence. Claude Sonnet 4.5 shows partial calibration (62.5% containment, 45% FP, TTFC of 10.6), suggesting calibration is not reliably present across frontier models. All models correctly identify the ground-truth threat when they act; the calibration gap is not in detection but in restraint. Code available at https://github.com/jbarnes850/opensec-env.

🔍 Key Points

  • Introduction of OpenSec, a dual-control reinforcement learning environment for evaluating incident response agents under adversarial conditions.
  • Implementation of execution-based scoring metrics such as Time-to-First-Containment (TTFC) and Evidence-Gated Action Rate (EGAR) to measure agent performance and calibration.
  • Revealing consistent over-triggering behavior in frontier models, with GPT-5.2 showing 100% containment but an 82.5% false positive rate, indicating potential calibration failures in action execution.
  • Categorization of model calibration, with Sonnet 4.5 demonstrating partial calibration while others, including GPT-5.2, display high rates of incorrect containment despite high detection accuracy.
  • Presentation of a systematic environment and clear metrics to better evaluate IR agent performance, going beyond traditional benchmarks.

💡 Why This Paper Matters

The paper presents significant advancements in measuring the calibration of incident response agents against adversarial scenarios, particularly in the context of evolving capabilities of offensive applications of AI. The introduction of the OpenSec environment and its evaluation metrics enables a more nuanced understanding of agent performance, which is crucial for developing reliable cybersecurity systems in contemporary threat landscapes.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of great interest to AI security researchers as it addresses a critical gap in evaluating the effectiveness of AI-based incident response systems. By focusing on calibration—a measure of how well agents restrain their actions in the face of false positives—the research highlights the need for improved evaluation frameworks. The findings also have implications for the development of robust AI systems capable of navigating the complexities of real-world security challenges, making it a valuable contribution to the field.

📚 Read the Full Paper