← Back to Library

Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents

Authors: Doron Shavit

Published: 2026-02-18

arXiv ID: 2602.16520v1

Added to Library: 2026-02-19 03:00 UTC

Red Teaming

📄 Abstract

Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic camouflage, and lightweight obfuscations that can evade single-pass guardrails. We present RLM-JB, an end-to-end jailbreak detection framework built on Recursive Language Models (RLMs), in which a root model orchestrates a bounded analysis program that transforms the input, queries worker models over covered segments, and aggregates evidence into an auditable decision. RLM-JB treats detection as a procedure rather than a one-shot classification: it normalizes and de-obfuscates suspicious inputs, chunks text to reduce context dilution and guarantee coverage, performs parallel chunk screening, and composes cross-chunk signals to recover split-payload attacks. On AutoDAN-style adversarial inputs, RLM-JB achieves high detection effectiveness across three LLM backends (ASR/Recall 92.5-98.0%) while maintaining very high precision (98.99-100%) and low false positive rates (0.0-2.0%), highlighting a practical sensitivity-specificity trade-off as the screening backend changes.

🔍 Key Points

  • Introduction of RLM-JB, an end-to-end jailbreak detection framework leveraging Recursive Language Models (RLMs) to enhance security in large language models (LLMs).
  • The paper emphasizes a procedural approach to jailbreak detection, utilizing techniques such as normalization, de-obfuscation, chunking, and compositional evidence aggregation to improve detection accuracy.
  • Experimental results demonstrate RLM-JB achieving high detection rates (Recall: 92.5-98.0%) and very high precision (98.99-100%) across multiple LLM backends, indicating its effectiveness in mitigating jailbreak threats.
  • The use of a multi-stage analysis pipeline allows RLM-JB to handle complex attack vectors like split-payload and long-context hiding, which traditional one-pass detection systems struggle with.
  • The thorough benchmarking and analysis of RLM-JB against existing solutions highlight significant improvements in performance metrics, reinforcing its potential for real-world application.

💡 Why This Paper Matters

This paper is crucial for advancing the field of AI security by systematically addressing the emerging threats posed by jailbreak prompts. By presenting a novel framework that improves detection through a procedural architecture, it offers an important contribution to safeguarding LLMs in real-world applications. The findings underline the necessity of evolving detection techniques to keep pace with sophisticated adversarial strategies in AI.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper pertinent due to its focused investigation into jailbreak attacks on LLMs, a pressing issue in the deployment of AI systems. The methodologies and results presented provide a foundational understanding of procedural defenses, which could inspire further research and development of robust security measures to protect against increasingly adaptive forms of adversarial manipulation.

📚 Read the Full Paper