Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

📄 Abstract

Large Language Models (LLMs) are susceptible to jailbreak attacks where malicious prompts are disguised using ciphers and character-level encodings to bypass safety guardrails. While these guardrails often fail to interpret the encoded content, the underlying models can still process the harmful instructions. We introduce CPT-Filtering, a novel, model-agnostic with negligible-costs and near-perfect accuracy guardrail technique that aims to mitigate these attacks by leveraging the intrinsic behavior of Byte-Pair Encoding (BPE) tokenizers. Our method is based on the principle that tokenizers, trained on natural language, represent out-of-distribution text, such as ciphers, using a significantly higher number of shorter tokens. Our technique uses a simple yet powerful artifact of using language models: the average number of Characters Per Token (CPT) in the text. This approach is motivated by the high compute cost of modern methods - relying on added modules such as dedicated LLMs or perplexity models. We validate our approach across a large dataset of over 100,000 prompts, testing numerous encoding schemes with several popular tokenizers. Our experiments demonstrate that a simple CPT threshold robustly identifies encoded text with high accuracy, even for very short inputs. CPT-Filtering provides a practical defense layer that can be immediately deployed for real-time text filtering and offline data curation.

🔍 Key Points

Introduction of CPT-Filtering, a model-agnostic technique for identifying encoded and ciphered prompts to enhance LLM security.
CPT-Filtering utilizes the statistical behavior of Byte-Pair Encoding (BPE) tokenizers, showing that obfuscated inputs require a higher average number of characters per token (CPT).
Demonstration of the effectiveness of the method across a large dataset of over 100,000 prompts and multiple encoding schemes, achieving nearly perfect detection accuracy.
The method is computationally efficient, requiring negligible additional costs and allowing for real-time application in filtering harmful prompts.
Validation of the technique across diverse languages and an exploration of its robustness in identifying mixed inputs with both obfuscated and benign text.

💡 Why This Paper Matters

The paper presents a significant advancement in the defenses against jailbreak attacks on large language models, offering a practical and efficient method. By leveraging the inherent properties of tokenizers, the authors propose a straightforward and implementable solution that can be integrated into existing systems with minimal overhead, potentially lowering the risk associated with language model deployments.

🎯 Why It's Interesting for AI Security Researchers

This paper is particularly relevant to AI security researchers due to its innovative approach to mitigating a critical vulnerability in LLMs. The method addresses a pressing need for robust filtering mechanisms capable of recognizing obfuscation techniques used in jailbreak attacks, which pose significant threats to the safe deployment of AI models. The findings and methodology may inspire further research and development of similar techniques, enhancing the overall security framework for artificial intelligence applications.

Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper