← Back to Library

When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents

Authors: Tsimur Hadeliya, Mohammad Ali Jauhar, Nidhi Sakpal, Diogo Cruz

Published: 2025-12-02

arXiv ID: 2512.02445v1

Added to Library: 2025-12-03 03:00 UTC

Safety

📄 Abstract

Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50\% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from $\sim$5\% to $\sim$40\% while Grok 4 Fast decreases from $\sim$80\% to $\sim$10\% at 200K tokens. Our work shows potential safety issues with agents operating on longer context and opens additional questions on the current metrics and paradigm for evaluating LLM agent safety on long multi-step tasks. In particular, our results on LLM agents reveal a notable divergence in both capability and safety performance compared to prior evaluations of LLMs on similar criteria.

🔍 Key Points

  • The study highlights the significant performance degradation of LLM agents when the context length exceeds 100K tokens, affecting both benign and harmful tasks.
  • Unexpected shifts in refusal rates were observed, with different models exhibiting divergent behaviors under longer context, raising concerns about safety mechanisms in LLMs.
  • The type and position of context padding can critically impact model performance, suggesting that coherent padding is preferable over random padding and that placement before the task can mitigate degradation effects.
  • Results indicate that longer context does not correlate with stronger capabilities, refuting claims of models being able to handle extensive input effectively.
  • The findings present potential safety threats in deploying LLM agents in multi-step tasks, necessitating the reconsideration of performance evaluation metrics for long-context models.

💡 Why This Paper Matters

This paper is crucial to understanding the limitations and safety concerns of large language models operating in agentic setups, particularly when handling extended contexts. It alerts researchers and practitioners to the need for careful evaluation of LLM performance, emphasizing that increased context length can lead to unexpected and potentially dangerous outcomes when models are faced with complex tasks.

🎯 Why It's Interesting for AI Security Researchers

The paper is important for AI security researchers as it reveals systematic vulnerabilities in LLMs that could be exploited in practical applications, particularly in situations requiring multi-step reasoning or decision-making. Understanding how context length impacts model behavior can help in designing better safety mechanisms and refining evaluation frameworks to enhance the reliability and trustworthiness of LLM agents.

📚 Read the Full Paper