← Back to Library

LLM as a Risk Manager: LLM Semantic Filtering for Lead-Lag Trading in Prediction Markets

Authors: Sumin Kim, Minjae Kim, Jihoon Kwon, Yoon Kim, Nicole Kagan, Joo Won Lee, Oscar Levy, Alejandro Lopez-Lira, Yongjae Lee, Chanyeol Choi

Published: 2026-02-04

arXiv ID: 2602.07048v1

Added to Library: 2026-02-10 03:04 UTC

📄 Abstract

Prediction markets provide a unique setting where event-level time series are directly tied to natural-language descriptions, yet discovering robust lead-lag relationships remains challenging due to spurious statistical correlations. We propose a hybrid two-stage causal screener to address this challenge: (i) a statistical stage that uses Granger causality to identify candidate leader-follower pairs from market-implied probability time series, and (ii) an LLM-based semantic stage that re-ranks these candidates by assessing whether the proposed direction admits a plausible economic transmission mechanism based on event descriptions. Because causal ground truth is unobserved, we evaluate the ranked pairs using a fixed, signal-triggered trading protocol that maps relationship quality into realized profit and loss (PnL). On Kalshi Economics markets, our hybrid approach consistently outperforms the statistical baseline. Across rolling evaluations, the win rate increases from 51.4% to 54.5%. Crucially, the average magnitude of losing trades decreases substantially from 649 USD to 347 USD. This reduction is driven by the LLM's ability to filter out statistically fragile links that are prone to large losses, rather than relying on rare gains. These improvements remain stable across different trading configurations, indicating that the gains are not driven by specific parameter choices. Overall, the results suggest that LLMs function as semantic risk managers on top of statistical discovery, prioritizing lead-lag relationships that generalize under changing market conditions.

🔍 Key Points

  • Introduction of OpenSec, a dual-control reinforcement learning environment for evaluating incident response agents under adversarial conditions.
  • Implementation of execution-based scoring metrics such as Time-to-First-Containment (TTFC) and Evidence-Gated Action Rate (EGAR) to measure agent performance and calibration.
  • Revealing consistent over-triggering behavior in frontier models, with GPT-5.2 showing 100% containment but an 82.5% false positive rate, indicating potential calibration failures in action execution.
  • Categorization of model calibration, with Sonnet 4.5 demonstrating partial calibration while others, including GPT-5.2, display high rates of incorrect containment despite high detection accuracy.
  • Presentation of a systematic environment and clear metrics to better evaluate IR agent performance, going beyond traditional benchmarks.

💡 Why This Paper Matters

The paper presents significant advancements in measuring the calibration of incident response agents against adversarial scenarios, particularly in the context of evolving capabilities of offensive applications of AI. The introduction of the OpenSec environment and its evaluation metrics enables a more nuanced understanding of agent performance, which is crucial for developing reliable cybersecurity systems in contemporary threat landscapes.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of great interest to AI security researchers as it addresses a critical gap in evaluating the effectiveness of AI-based incident response systems. By focusing on calibration—a measure of how well agents restrain their actions in the face of false positives—the research highlights the need for improved evaluation frameworks. The findings also have implications for the development of robust AI systems capable of navigating the complexities of real-world security challenges, making it a valuable contribution to the field.

📚 Read the Full Paper