← Back to Library

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

Authors: Rongzhe Wei, Peizhi Niu, Xinjie Shen, Tony Tu, Yifan Li, Ruihan Wu, Eli Chien, Pin-Yu Chen, Olgica Milenkovic, Pan Li

Published: 2025-12-01

arXiv ID: 2512.01353v2

Added to Library: 2025-12-04 03:01 UTC

Red Teaming

📄 Abstract

Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs. Existing approaches overwhelmingly operate within the prompt-optimization paradigm: whether through traditional algorithmic search or recent agent-based workflows, the resulting prompts typically retain malicious semantic signals that modern guardrails are primed to detect. In contrast, we identify a deeper, largely overlooked vulnerability stemming from the highly interconnected nature of an LLM's internal knowledge. This structure allows harmful objectives to be realized by weaving together sequences of benign sub-queries, each of which individually evades detection. To exploit this loophole, we introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base. The CKA-Agent issues locally innocuous queries, uses model responses to guide exploration across multiple paths, and ultimately assembles the aggregated information to achieve the original harmful objective. Evaluated across state-of-the-art commercial LLMs (Gemini2.5-Flash/Pro, GPT-oss-120B, Claude-Haiku-4.5), CKA-Agent consistently achieves over 95% success rates even against strong guardrails, underscoring the severity of this vulnerability and the urgent need for defenses against such knowledge-decomposition attacks. Our codes are available at https://github.com/Graph-COM/CKA-Agent.

🔍 Key Points

  • Introduction of the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that allows for effective jailbreak attacks on large language models (LLMs) using harmless prompt weaving and adaptive tree search techniques.
  • CKA-Agent achieves over 95% success rates in bypassing guardrails of commercial LLMs, demonstrating a significant vulnerability in contemporary AI defenses.
  • The methodology utilizes a unique tree-structured exploration of the knowledge base of LLMs, mapping complex reasoning into structured queries that decompose harmful objectives into innocuous ones.
  • Empirical evaluations across multiple LLMs show that existing defenses are ineffective against the adaptive, knowledge-decomposition strategies employed by the CKA-Agent.
  • The paper provides a rigorous analysis of human and LLM judge correlations, highlighting that the current LLM defenses lack the necessary long-range intent detection capabilities.

💡 Why This Paper Matters

This paper is relevant as it exposes critical vulnerabilities in the robustness of current AI safety mechanisms against jailbreaking attacks. The CKA-Agent methodology is a significant advancement in how such exploits can be conducted, shifting from direct prompt manipulation to sophisticated knowledge exploration, which can have profound implications for the safety and ethical deployment of AI systems.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper would intrigue AI security researchers because they highlight a fundamental gap in existing alignment procedures. The success of the CKA-Agent in violating contemporary safeguard measures demonstrates urgent challenges for AI safety frameworks, necessitating further research into more effective defensive strategies against such sophisticated adversarial techniques.

📚 Read the Full Paper