← Back to Library

$α^3$-Bench: A Unified Benchmark of Safety, Robustness, and Efficiency for LLM-Based UAV Agents over 6G Networks

Authors: Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah

Published: 2026-01-01

arXiv ID: 2601.03281v1

Added to Library: 2026-01-08 04:00 UTC

Safety

📄 Abstract

Large Language Models (LLMs) are increasingly used as high level controllers for autonomous Unmanned Aerial Vehicle (UAV) missions. However, existing evaluations rarely assess whether such agents remain safe, protocol compliant, and effective under realistic next generation networking constraints. This paper introduces $α^3$-Bench, a benchmark for evaluating LLM driven UAV autonomy as a multi turn conversational reasoning and control problem operating under dynamic 6G conditions. Each mission is formulated as a language mediated control loop between an LLM based UAV agent and a human operator, where decisions must satisfy strict schema validity, mission policies, speaker alternation, and safety constraints while adapting to fluctuating network slices, latency, jitter, packet loss, throughput, and edge load variations. To reflect modern agentic workflows, $α^3$-Bench integrates a dual action layer supporting both tool calls and agent to agent coordination, enabling evaluation of tool use consistency and multi agent interactions. We construct a large scale corpus of 113k conversational UAV episodes grounded in UAVBench scenarios and evaluate 17 state of the art LLMs using a fixed subset of 50 episodes per scenario under deterministic decoding. We propose a composite $α^3$ metric that unifies six pillars: Task Outcome, Safety Policy, Tool Consistency, Interaction Quality, Network Robustness, and Communication Cost, with efficiency normalized scores per second and per thousand tokens. Results show that while several models achieve high mission success and safety compliance, robustness and efficiency vary significantly under degraded 6G conditions, highlighting the need for network aware and resource efficient LLM based UAV agents. The dataset is publicly available on GitHub : https://github.com/maferrag/AlphaBench

🔍 Key Points

  • Introduction of $\alpha^3$-Bench, a comprehensive and systematic benchmark for evaluating LLM-driven autonomous UAV agents under dynamic 6G network conditions.
  • Development of a dual action layer that incorporates Model Context Protocol (MCP) and Agent-to-Agent (A2A) communications for assessing complex multi-agent interactions and tool use consistency.
  • Creation of a large-scale corpus of 113,000 conversational UAV episodes based on UAVBench scenarios, allowing for statistically robust evaluations of various LLM models.
  • Proposal of a composite $\alpha^3$ metric that unifies six performance pillars (Task Outcome, Safety Policy, Tool Consistency, Interaction Quality, Network Robustness, Communication Cost) to holistically assess performance across different LLMs.
  • Findings indicate variation in robustness and efficiency of LLMs under degraded 6G conditions, emphasizing the importance of network-aware AI agents for safety-critical UAV missions.

💡 Why This Paper Matters

This paper on $\alpha^3$-Bench highlights the growing intersection of AI, autonomous systems, and 6G networks, revealing critical insights into the evaluation of LLM-based UAV agents. With its comprehensive benchmarking framework, this work addresses significant gaps in existing evaluations, emphasizing the need for reliability, safety, and resource efficiency in AI systems. Its relevance extends beyond UAV operations, potentially influencing the deployment strategies of AI in other safety-critical domains. By understanding how different LLMs perform under varying network conditions, this research lays foundational work for developing trustworthy AI systems that can adapt in real-time to their operational environments.

🎯 Why It's Interesting for AI Security Researchers

For AI security researchers, this paper is particularly relevant due to its focus on ensuring the safety and robustness of autonomous UAV systems driven by LLMs. The comprehensive benchmarking framework addresses crucial aspects such as safety policy compliance, interaction quality, and network robustness, which are essential for secure and reliable AI operations. Moreover, the insights gained from the evaluations can inform security measures to mitigate risks associated with adversarial attacks or system failures in real-world deployment scenarios. As such, findings from this research can assist in advancing security protocols and standards for AI in autonomous systems operating in uncertain and dynamic environments.

📚 Read the Full Paper