← Back to Library

Context Dependence and Reliability in Autoregressive Language Models

Authors: Poushali Sengupta, Shashi Raj Pandey, Sabita Maharjan, Frank Eliassen

Published: 2026-02-01

arXiv ID: 2602.01378v1

Added to Library: 2026-02-03 08:03 UTC

📄 Abstract

Large language models (LLMs) generate outputs by utilizing extensive context, which often includes redundant information from prompts, retrieved passages, and interaction history. In critical applications, it is vital to identify which context elements actually influence the output, as standard explanation methods struggle with redundancy and overlapping context. Minor changes in input can lead to unpredictable shifts in attribution scores, undermining interpretability and raising concerns about risks like prompt injection. This work addresses the challenge of distinguishing essential context elements from correlated ones. We introduce RISE (Redundancy-Insensitive Scoring of Explanation), a method that quantifies the unique influence of each input relative to others, minimizing the impact of redundancies and providing clearer, stable attributions. Experiments demonstrate that RISE offers more robust explanations than traditional methods, emphasizing the importance of conditional information for trustworthy LLM explanations and monitoring.

🔍 Key Points

  • Introduction of the Provable Defense Framework that leverages Certified Semantic Smoothing (CSS) to provide rigorous safety guarantees against jailbreaking attacks on LLMs.
  • Development of Noise-Augmented Alignment Tuning (NAAT) to address performance degradation while ensuring security, effectively transforming LLMs into semantic denoising models.
  • Empirical results demonstrate a drastic reduction in the Attack Success Rate from 84.2% to 1.2% while maintaining a high benign utility rate of 94.1% on the Llama-3 model, significantly outperforming existing defenses.
  • Establishment of a certified radius based on the Hypergeometric distribution for discrete token substitutions, effectively correcting misapplied scaling laws found in prior heuristic defenses.
  • Methodological advancement through stratified randomized ablation, preserving the structural integrity of inputs and enabling effective adversarial robustness against multiple attack variants.

💡 Why This Paper Matters

The paper presents a foundational approach to enhancing the security of Large Language Models (LLMs) against adaptive adversarial attacks. By combining certified robustness with effective tuning methods, it offers a significant leap forward in creating invulnerable LLMs, crucial for real-world applications where safety is paramount.

🎯 Why It's Interesting for AI Security Researchers

This research is highly relevant for AI security researchers as it tackles the pressing challenge of adversarial attacks on LLMs, a growing concern in AI deployment. The novel certification methods and empirical results provide a benchmark for future studies aimed at improving LLM safety without sacrificing performance, making it a significant contribution to the field of AI security.

📚 Read the Full Paper