TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering

📄 Abstract

Large language models remain vulnerable to jailbreak attacks, and single-layer defenses often trade security for usability. We present TRYLOCK, the first defense-in-depth architecture that combines four heterogeneous mechanisms across the inference stack: weight-level safety alignment via DPO, activation-level control via Representation Engineering (RepE) steering, adaptive steering strength selected by a lightweight sidecar classifier, and input canonicalization to neutralize encoding-based bypasses. On Mistral-7B-Instruct evaluated against a 249-prompt attack set spanning five attack families, TRYLOCK achieves 88.0% relative ASR reduction (46.5% to 5.6%), with each layer contributing unique coverage: RepE blocks 36% of attacks that bypass DPO alone, while canonicalization catches 14% of encoding attacks that evade both. We discover a non-monotonic steering phenomenon -- intermediate strength (alpha=1.0) degrades safety below baseline -- and provide mechanistic hypotheses explaining RepE-DPO interference. The adaptive sidecar reduces over-refusal from 60% to 48% while maintaining identical attack defense, demonstrating that security and usability need not be mutually exclusive. We release all components -- trained adapters, steering vectors, sidecar classifier, preference pairs, and complete evaluation methodology -- enabling full reproducibility.

🔍 Key Points

Introduction of TRYLOCK: The first defense-in-depth architecture for LLMs consisting of four layers—weight-level safety alignment (DPO), activation-level control (RepE steering), adaptive classifier steering, and input canonicalization.
Empirical results show that TRYLOCK achieves an 88.0% relative reduction in Attack Success Rate (ASR) from 46.5% to 5.6%, with each layer providing complementary protection against various attack types.
Discovery of a non-monotonic steering phenomenon, where intermediate steering strength worsens safety outcomes, highlighting the need for careful optimization in activation steering techniques.
Adaptive sidecar classifier reduces over-refusal from 60% to 48%, indicating that security and usability can be balanced effectively in LLM safety.
The paper emphasizes the importance of layered defenses in AI security, suggesting that combining heterogeneous mechanisms can lead to more resilient systems against jailbreak attacks.

💡 Why This Paper Matters

The presentation of TRYLOCK represents a significant advancement in LLM safety by introducing a sophisticated, multi-layered defense approach specifically designed to counter jailbreak attacks. By effectively combining various protective mechanisms and demonstrating substantial improvements in both attack prevention and usability, this work sets a new standard for future research in AI safety. The provision of open-source components further promotes reproducibility and innovation within the field.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it addresses one of the critical vulnerabilities in deployed large language models—jailbreak attacks. The proposed defense architecture not only mitigates these risks using a novel, multi-layered approach but also emphasizes the practical aspects of usability, which is crucial for real-world applications. Moreover, the open release of methodologies and datasets fosters further exploration and enhancement of security mechanisms in AI systems.

TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper