Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents

📄 Abstract

Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic camouflage, and lightweight obfuscations that can evade single-pass guardrails. We present RLM-JB, an end-to-end jailbreak detection framework built on Recursive Language Models (RLMs), in which a root model orchestrates a bounded analysis program that transforms the input, queries worker models over covered segments, and aggregates evidence into an auditable decision. RLM-JB treats detection as a procedure rather than a one-shot classification: it normalizes and de-obfuscates suspicious inputs, chunks text to reduce context dilution and guarantee coverage, performs parallel chunk screening, and composes cross-chunk signals to recover split-payload attacks. On AutoDAN-style adversarial inputs, RLM-JB achieves high detection effectiveness across three LLM backends (ASR/Recall 92.5-98.0%) while maintaining very high precision (98.99-100%) and low false positive rates (0.0-2.0%), highlighting a practical sensitivity-specificity trade-off as the screening backend changes.

🔍 Key Points

Introduction of RLM-JB, an end-to-end jailbreak detection framework leveraging Recursive Language Models (RLMs) to enhance security in large language models (LLMs).
The paper emphasizes a procedural approach to jailbreak detection, utilizing techniques such as normalization, de-obfuscation, chunking, and compositional evidence aggregation to improve detection accuracy.
Experimental results demonstrate RLM-JB achieving high detection rates (Recall: 92.5-98.0%) and very high precision (98.99-100%) across multiple LLM backends, indicating its effectiveness in mitigating jailbreak threats.
The use of a multi-stage analysis pipeline allows RLM-JB to handle complex attack vectors like split-payload and long-context hiding, which traditional one-pass detection systems struggle with.
The thorough benchmarking and analysis of RLM-JB against existing solutions highlight significant improvements in performance metrics, reinforcing its potential for real-world application.

💡 Why This Paper Matters

This paper is crucial for advancing the field of AI security by systematically addressing the emerging threats posed by jailbreak prompts. By presenting a novel framework that improves detection through a procedural architecture, it offers an important contribution to safeguarding LLMs in real-world applications. The findings underline the necessity of evolving detection techniques to keep pace with sophisticated adversarial strategies in AI.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper pertinent due to its focused investigation into jailbreak attacks on LLMs, a pressing issue in the deployment of AI systems. The methodologies and results presented provide a foundational understanding of procedural defenses, which could inspire further research and development of robust security measures to protect against increasingly adaptive forms of adversarial manipulation.

Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper