← Back to Library

The Autonomy Tax: Defense Training Breaks LLM Agents

Authors: Shawn Li, Yue Zhao

Published: 2026-03-19

arXiv ID: 2603.19423v1

Added to Library: 2026-03-23 02:02 UTC

Red Teaming

📄 Abstract

Large language model (LLM) agents increasingly rely on external tools (file operations, API calls, database transactions) to autonomously complete complex multi-step tasks. Practitioners deploy defense-trained models to protect against prompt injection attacks that manipulate agent behavior through malicious observations or retrieved content. We reveal a fundamental \textbf{capability-alignment paradox}: defense training designed to improve safety systematically destroys agent competence while failing to prevent sophisticated attacks. Evaluating defended models against undefended baselines across 97 agent tasks and 1,000 adversarial prompts, we uncover three systematic biases unique to multi-step agents. \textbf{Agent incompetence bias} manifests as immediate tool execution breakdown, with models refusing or generating invalid actions on benign tasks before observing any external content. \textbf{Cascade amplification bias} causes early failures to propagate through retry loops, pushing defended models to timeout on 99\% of tasks compared to 13\% for baselines. \textbf{Trigger bias} leads to paradoxical security degradation where defended models perform worse than undefended baselines while straightforward attacks bypass defenses at high rates. Root cause analysis reveals these biases stem from shortcut learning: models overfit to surface attack patterns rather than semantic threat understanding, evidenced by extreme variance in defense effectiveness across attack categories. Our findings demonstrate that current defense paradigms optimize for single-turn refusal benchmarks while rendering multi-step agents fundamentally unreliable, necessitating new approaches that preserve tool execution competence under adversarial conditions.

🔍 Key Points

  • Introduction of the 'capability-alignment paradox' in defense training of LLMs, where improving safety compromises agent competence.
  • Identification of three systematic biases in LLM agents due to defense training: agent incompetence bias, cascade amplification bias, and trigger bias.
  • Agent incompetence bias results in high initial failure rates on benign tasks before seeing any adversarial content, while cascade amplification leads to failures propagating through retry loops, resulting in timeouts for defended models.
  • Trigger bias highlights how keyword-matching defenses can paradoxically allow attacks to bypass detection while increasing false refusals on benign inputs.
  • The paper emphasizes the necessity for a shift in defense paradigms that accommodate multi-step reasoning and maintain agent competency under adversarial conditions.

💡 Why This Paper Matters

The paper presents crucial insights into the shortcomings of current defense mechanisms for LLM agents, specifically how existing training can lead to significant operational failures. The identified biases and the concept of the 'autonomy tax' serve as critical reminders for the need to balance safety measures with the fundamental execution capabilities of AI agents. As LLMs become increasingly pivotal in sensitive applications, understanding these dynamics is essential for developing more effective and reliable systems.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant to AI security researchers because it sheds light on the vulnerabilities inherent in LLMs as they are deployed in multi-step task environments. The findings prompt a reevaluation of existing defense strategies against prompt injection attacks and illustrate the limitations of current training paradigms. Researchers focused on improving the robustness of AI systems will find the discussions about the paradoxical outcomes and biases critical for designing future defenses that can safeguard complex AI applications while preserving their functionality.

📚 Read the Full Paper