← Back to Library

ADAG: Automatically Describing Attribution Graphs

Authors: Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann

Published: 2026-04-08

arXiv ID: 2604.07615v1

Added to Library: 2026-04-10 02:02 UTC

πŸ“„ Abstract

In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations underlying some behaviour. However, all prior circuit tracing work has relied on ad-hoc human interpretation of the role that each feature in the circuit plays, via manual inspection of data artifacts such as the dataset examples the component activates on. We introduce \textbf{ADAG}, an end-to-end pipeline for describing these attribution graphs which is fully automated. To achieve this, we introduce \textit{attribution profiles} which quantify the functional role of a feature via its input and output gradient effects. We then introduce a novel clustering algorithm for grouping features, and an LLM explainer--simulator setup which generates and scores natural-language explanations of the functional role of these feature groups. We run our system on known human-analysed circuit-tracing tasks and recover interpretable circuits, and further show ADAG can find steerable clusters which are responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct.

πŸ” Key Points

  • Introduction of TrajGuard, a framework that utilizes hidden-state trajectories for real-time jailbreak detection without the need for training or model modification.
  • Demonstration that hidden states in critical layers during decoding contain stronger signals for risk assessment than static input prompts, which contribute to the success of jailbreak attempts.
  • Implementation of a two-module system: Streaming Geometric Surveillance (SGS) for risk signal monitoring, and Prompt–Answer Inference Referee (PAIR-Judge) for semantic analysis during high-risk scenarios.
  • Achieved an average defense rate of 95% with reduced detection latency of 5.2 ms/token, highlighting the framework's efficiency in real-time settings.
  • Performed extensive evaluations across various open-source language models and reported robust defense against 12 distinct jailbreak attacks.

πŸ’‘ Why This Paper Matters

This paper presents a significant advancement in the field of AI security by proposing TrajGuard, a dynamic framework that effectively intercepts jailbreak attempts during decoding by leveraging internal model representations. The innovative approach of analyzing hidden-state trajectories offers a promising alternative to traditional static defenses, marking a vital step toward enhancing the safety of large language models in interaction scenarios.

🎯 Why It's Interesting for AI Security Researchers

This work is particularly relevant for AI security researchers as it addresses the persistent threat posed by jailbreak attacks, which exploit vulnerabilities in large language models. The introduction of a framework that operates on the internal dynamics of model behavior rather than just external input signifies a shift in defense strategies, opening up new avenues for research and development in real-time AI safety mechanisms.

πŸ“š Read the Full Paper