← Back to Library

Highlight & Summarize: RAG without the jailbreaks

Authors: Giovanni Cherubin, Andrew Paverd

Published: 2025-08-04

arXiv ID: 2508.02872v1

Added to Library: 2025-08-14 23:05 UTC

📄 Abstract

Preventing jailbreaking and model hijacking of Large Language Models (LLMs) is an important yet challenging task. For example, when interacting with a chatbot, malicious users can input specially crafted prompts to cause the LLM to generate undesirable content or perform a completely different task from its intended purpose. Existing mitigations for such attacks typically rely on hardening the LLM's system prompt or using a content classifier trained to detect undesirable content or off-topic conversations. However, these probabilistic approaches are relatively easy to bypass due to the very large space of possible inputs and undesirable outputs. In this paper, we present and evaluate Highlight & Summarize (H&S), a new design pattern for retrieval-augmented generation (RAG) systems that prevents these attacks by design. The core idea is to perform the same task as a standard RAG pipeline (i.e., to provide natural language answers to questions, based on relevant sources) without ever revealing the user's question to the generative LLM. This is achieved by splitting the pipeline into two components: a highlighter, which takes the user's question and extracts relevant passages ("highlights") from the retrieved documents, and a summarizer, which takes the highlighted passages and summarizes them into a cohesive answer. We describe several possible instantiations of H&S and evaluate their generated responses in terms of correctness, relevance, and response quality. Surprisingly, when using an LLM-based highlighter, the majority of H&S responses are judged to be better than those of a standard RAG pipeline.

🔍 Key Points

  • Introduction of CoCoTen, a novel method based on latent space features derived from Contextual Co-occurrence Matrices and Tensors, for detecting adversarial and jailbreak inputs in LLMs.
  • The method demonstrates high effectiveness, achieving an F1 score of 0.83 with only 0.5% of labeled data, signifying a 96.6% improvement over existing baseline models.
  • CoCoTen exhibits significant computational efficiency, with processing speedups ranging from 2.3 to 128.4 times compared to baseline methods, emphasizing its practicality for real-world applications.
  • The method's performance indicates robustness in label-scarce environments, providing a promising approach for enhancing LLM security against evolving adversarial techniques.
  • The paper supports future research and reproducibility by publicly releasing the implementation of CoCoTen.

💡 Why This Paper Matters

The paper presents a significant advancement in the field of AI security, specifically addressing the vulnerabilities of Large Language Models (LLMs) to adversarial attacks. By leveraging innovative techniques involving Contextual Co-occurrence Matrices and Tensor decomposition, CoCoTen demonstrates superior performance even in data-scarce settings, providing a reliable solution for detecting harmful inputs and increasing the trustworthiness of AI systems. This work is particularly relevant given the growing concerns about the misuse of LLMs and sets a foundation for future exploration in this area.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of particular interest to AI security researchers as it addresses a critical challenge in the safety and reliability of Large Language Models by proposing a new detection method for adversarial prompts. With the growing prevalence of jailbreaking and adversarial attacks on AI systems, understanding and mitigating these risks is vital. The effectiveness of CoCoTen in data-scarce environments and its substantial efficiency improvements over existing models make it a valuable contribution to the ongoing discourse on AI safety.

📚 Read the Full Paper