← Back to Library

Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

Authors: Kerem Zaman, Shashank Srivastava

Published: 2025-12-28

arXiv ID: 2512.23032v1

Added to Library: 2026-01-07 10:07 UTC

📄 Abstract

Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with Llama-3 and Gemma-3, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time token budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics.

🔍 Key Points

  • Introduction of the Task-Redirecting Agent Persuasion Benchmark (TRAP) for evaluating prompt injection attacks on web-based agents.
  • Demonstrated average susceptibility of 25% to prompt injections across six LLM models, highlighting model-specific vulnerabilities.
  • Developed a modular attack framework that integrates components of persuasion and interaction, enabling versatile experimental setups and comprehensive analysis.
  • Presented empirical findings that effective manipulation techniques vary based on user interface forms (buttons vs hyperlinks) and contextual tailoring.
  • Identified significant differences in attack success rates across models, emphasizing the correlation between model robustness and vulnerability.

💡 Why This Paper Matters

The TRAP benchmark provides an essential tool for assessing and improving the security of LLM-driven web agents against sophisticated prompt injection attacks. By systematically exploring the vulnerabilities of various models and exposing their weaknesses, this paper highlights critical areas for enhancing the resilience and reliability of AI systems in real-world applications.

🎯 Why It's Interesting for AI Security Researchers

This paper is especially pertinent for AI security researchers as it not only identifies and categorizes vulnerabilities within LLM agents but also offers a structured framework for evaluating and defending against such attacks. By detailing specific manipulation methods and their effectiveness, the findings contribute to a deeper understanding of AI security challenges and inform the development of stronger protective measures.

📚 Read the Full Paper