Attacks by Content: Automated Fact-checking is an AI Security Issue

📄 Abstract

When AI agents retrieve and reason over external documents, adversaries can manipulate the data they receive to subvert their behaviour. Previous research has studied indirect prompt injection, where the attacker injects malicious instructions. We argue that injection of instructions is not necessary to manipulate agents - attackers could instead supply biased, misleading, or false information. We term this an attack by content. Existing defenses, which focus on detecting hidden commands, are ineffective against attacks by content. To defend themselves and their users, agents must critically evaluate retrieved information, corroborating claims with external evidence and evaluating source trustworthiness. We argue that this is analogous to an existing NLP task, automated fact-checking, which we propose to repurpose as a cognitive self-defense tool for agents.

🔍 Key Points

The paper introduces the concept of Deep Research (DR) agents that leverage LLMs to perform complex research tasks, revealing significant vulnerabilities when such agents respond to harmful queries.
It outlines two novel jailbreak strategies—Plan Injection and Intent Hijack—that exploit the planning and research capabilities of DR agents, demonstrating their risks in generating harmful content.
Extensive experiments highlight that DR agents can circumvent traditional alignment mechanisms by producing coherent and dangerous reports that standalone LLMs would reject.
The proposed DeepREJECT evaluation metric is introduced, which assesses whether the generated content aligns with harmful intents and the quality of knowledge provided, outperforming previous benchmarks.
The findings raise critical questions about the safety measures in deploying LLMs in sensitive domains, especially in contexts like biosecurity.

💡 Why This Paper Matters

This paper is crucial as it identifies the elevated risks associated with Deep Research agents powered by Large Language Models, emphasizing the urgent need for refined safety analyses and robust alignment strategies. The methodologies proposed offer significant insights into the potential for misuse in high-stakes domains, calling for an overhaul in how AI systems are designed to ensure safety in practical applications.

🎯 Why It's Interesting for AI Security Researchers

The paper will intrigue AI security researchers as it exposes the critical vulnerabilities in existing alignment frameworks when applied to advanced AI systems like DR agents. It provides novel attack methodologies that can inform the development of more robust safety protocols and prompts further investigation into the potential misuse of AI technologies in sensitive and high-risk environments.

Attacks by Content: Automated Fact-checking is an AI Security Issue

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper