← Back to Library

OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

Authors: Jarrod Barnes

Published: 2026-01-28

arXiv ID: 2601.21083v1

Added to Library: 2026-01-30 03:00 UTC

Red Teaming

📄 Abstract

As large language models improve, so do their offensive applications: frontier agents now generate working exploits for under $50 in compute (Heelan, 2026). Defensive incident response (IR) agents must keep pace, but existing benchmarks conflate action execution with correct execution, hiding calibration failures when agents process adversarial evidence. We introduce OpenSec, a dual-control reinforcement learning environment that evaluates IR agents under realistic prompt injection scenarios. Unlike static capability benchmarks, OpenSec scores world-state-changing containment actions under adversarial evidence via execution-based metrics: time-to-first-containment (TTFC), blast radius (false positives per episode), and injection violation rates. Evaluating four frontier models on 40 standard-tier episodes, we find consistent over-triggering in this setting: GPT-5.2, Gemini 3, and DeepSeek execute containment in 100% of episodes with 90-97% false positive rates. Claude Sonnet 4.5 shows partial calibration (85% containment, 72% FP), demonstrating that OpenSec surfaces a calibration failure mode hidden by aggregate success metrics. Code available at https://github.com/jbarnes850/opensec-env.

🔍 Key Points

  • Introduction of OpenSec: A dual-control reinforcement learning environment designed to measure incident response agent calibration under adversarial evidence, moving beyond traditional metrics that confuse execution with correctness.
  • Execution-based metrics: OpenSec evaluates IR actions through metrics such as time-to-first-containment (TTFC), false positive rates, and injection violation rates, highlighting the gap between willingness to act and action correctness.
  • Evaluation of frontier models: The study demonstrated that leading models (GPT-5.2, Gemini 3, and DeepSeek) consistently over-trigger actions with high false positive rates (90-97%), indicating a calibration failure not captured by aggregate success metrics.
  • Partial calibration of models: Only Claude Sonnet 4.5 showed reliable calibration (85% containment rate, 72% FP), suggesting calibration behavior varies widely across models and is not inherently tied to capability or training methodology.

💡 Why This Paper Matters

This paper is significant because it introduces a novel benchmarking framework, OpenSec, for assessing incident response agents' performance in cybersecurity. It highlights critical deficiencies in existing evaluation methods, revealing pervasive calibration issues and their operational implications. By providing a more nuanced understanding of agent behavior in adversarial contexts, it opens pathways for improving algorithmic calibration which is vital for enhancing the efficacy of AI applications in security operations centers.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper would interest AI security researchers as it not only exposes the limitations of current evaluation metrics but also offers a practical framework for training and assessing AI agents that respond to cybersecurity incidents. The emphasis on agent calibration under adversarial conditions speaks directly to ongoing challenges in AI system safety and reliability, making it a pivotal contribution for future developments in this area.

📚 Read the Full Paper