← Back to Library

Multi-Agent LLM Governance for Safe Two-Timescale Reinforcement Learning in SDN-IoT Defense

Authors: Saeid Jamshidi, Negar Shahabi, Foutse Khomh, Carol Fung, Mohammad Hamdaqa

Published: 2026-04-01

arXiv ID: 2604.01127v1

Added to Library: 2026-04-02 03:00 UTC

Safety Risk & Governance

📄 Abstract

Software-Defined Networking (SDN) is increasingly adopted to secure Internet-of-Things (IoT) networks due to its centralized control and programmable forwarding. However, SDN-IoT defense is inherently a closed-loop control problem in which mitigation actions impact controller workload, queue dynamics, rule-installation delay, and future traffic observations. Aggressive mitigation may destabilize the control plane, degrade Quality of Service (QoS), and amplify systemic risk. Existing learning-based approaches prioritize detection accuracy while neglecting controller coupling and short-horizon Reinforcement Learning (RL) optimization without structured, auditable policy evolution. This paper introduces a self-reflective two-timescale SDN-IoT defense solution separating fast mitigation from slow policy governance. At the fast timescale, per-switch Proximal Policy Optimization (PPO) agents perform controller-aware mitigation under safety constraints and action masking. At the slow timescale, a multi-agent Large Language Model (LLM) governance engine generates machine-parsable updates to the global policy constitution Pi, which encodes admissible actions, safety thresholds, and reward priorities. Updates (Delta Pi) are validated through stress testing and deployed only with non-regression and safety guarantees, ensuring an auditable evolution without retraining RL agents. Evaluation under heterogeneous IoT traffic and adversarial stress shows improvements of 9.1% Macro-F1 over PPO and 15.4% over static baselines. Worst-case degradation drops by 36.8%, controller backlog peaks by 42.7%, and RTT p95 inflation remains below 5.8% under high-intensity attacks. Policy evolution converges within five cycles, reducing catastrophic overload from 11.6% to 2.3%.

🔍 Key Points

  • Introduction of a self-reflective two-timescale architecture for SDN-IoT defense, separating fast decentralized mitigation from slow policy governance.
  • Utilization of per-switch Proximal Policy Optimization (PPO) agents under safety constraints for immediate mitigation actions, while a multi-agent Large Language Model (LLM) governs global policy updates.
  • Structured, auditable policy evolution through machine-parsable edits ensures safe operation and stability in a closed-loop control environment, decreasing catastrophic overload events significantly.
  • Empirical evaluations show improvements in detection performance (9.1% improvement in Macro-F1) and substantial reductions in controller backlog and worst-case degradation during high-intensity attacks.
  • The approach balances security, Quality of Service (QoS), and operational cost, demonstrating that safety and mitigation can coexist effectively in SDN-IoT networks.

💡 Why This Paper Matters

This paper presents significant advancements in the security management of SDN-IoT networks by addressing both immediate defense actions and long-term policy governance. The proposed two-timescale architecture enables more reliable and stable operation under adversarial conditions, which is critical given the increasing complexity and accessibility of IoT devices in various applications. By integrating robust reinforcement learning with careful policy evolution, the work achieves a meaningful balance between responsiveness and safety, making it a crucial reference for future research and deployment in AI-driven security solutions.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper of great interest due to its innovative integration of reinforcement learning and governance mechanisms within the challenging context of SDN-IoT environments. The focus on tail-risk management rather than solely mean performance optimization addresses a vital gap in traditional security models, emphasizing resilience and operational stability in dynamic and potentially hostile network conditions. Furthermore, the structured policy evolution approach provides a unique framework that could inspire future studies on auditable and interpretable AI systems in security applications.

📚 Read the Full Paper