CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors

📄 Abstract

The widespread use of Large Language Models (LLMs) in many applications marks a significant advance in research and practice. However, their complexity and hard-to-understand nature make them vulnerable to attacks, especially jailbreaks designed to produce harmful responses. To counter these threats, developing strong detection methods is essential for the safe and reliable use of LLMs. This paper studies this detection problem using the Contextual Co-occurrence Matrix, a structure recognized for its efficacy in data-scarce environments. We propose a novel method leveraging the latent space characteristics of Contextual Co-occurrence Matrices and Tensors for the effective identification of adversarial and jailbreak prompts. Our evaluations show that this approach achieves a notable F1 score of 0.83 using only 0.5% of labeled prompts, which is a 96.6% improvement over baselines. This result highlights the strength of our learned patterns, especially when labeled data is scarce. Our method is also significantly faster, speedup ranging from 2.3 to 128.4 times compared to the baseline models. To support future research and reproducibility, we have made our implementation publicly available.

🔍 Key Points

Introduction of CoCoTen, a novel method for detecting adversarial inputs to Large Language Models (LLMs) using latent space features of Contextual Co-occurrence Tensors, achieving notable accuracy even with minimal labeled data.
Demonstration of significant performance improvements over baseline models with a reported F1 score of 0.83 using just 0.5% labeled prompts, and speed enhancements ranging from 2.3x to 128.4x faster than traditional models.
Comprehensive analysis of dataset characteristics and methodological robustness, with acceptable generalization across different datasets despite known challenges in adversarial prompt detection.
Sensitivity testing on hyperparameters such as co-occurrence window size and tensor rank to optimize model performance, contributing to deeper insights on model tuning in adversarial contexts.

💡 Why This Paper Matters

This paper is relevant as it addresses critical vulnerabilities in LLMs by presenting a new, efficient detection method for adversarial prompts, enhancing the security and reliability of AI systems. CoCoTen's ability to function effectively with minimal labeled data and reduced computational demands marks a significant advance in safeguarding AI applications against jailbreak attacks, ultimately promoting safer AI deployment in real-world scenarios.

🎯 Why It's Interesting for AI Security Researchers

This paper is of substantial interest to AI security researchers because it tackles the pressing issue of adversarial attacks, particularly the jailbreak types that aim to manipulate LLM outputs. The innovative approaches outlined, along with empirical results indicating strong performance even in data-scarce environments, provide foundational methods that could be expanded upon or integrated into existing AI security frameworks. Furthermore, the open access to implementation facilitates reproducibility and follow-up studies in the field.

CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper