Colluding LoRA: A Composite Attack on LLM Safety Alignment

📄 Abstract

We introduce Colluding LoRA (CoLoRA), an attack in which each adapter appears benign and plausibly functional in isolation, yet their linear composition consistently compromises safety. Unlike attacks that depend on specific input triggers or prompt patterns, CoLoRA is a composition-triggered broad refusal suppression: once a particular set of adapters is loaded, the model undergoes effective alignment degradation, complying with harmful requests without requiring adversarial prompts or suffixes. This attack exploits the combinatorial blindness of current defense systems, where exhaustively scanning all compositions is computationally intractable. Across several open-weight LLMs, CoLoRA achieves benign behavior individually yet high attack success rate after composition, indicating that securing modular LLM supply-chains requires moving beyond single-module verification toward composition-aware defenses.

🔍 Key Points

Introduction of Colluding LoRA (CoLoRA), an attack that uses composition of adapters to compromise model safety without requiring specific input triggers.
Demonstration of composition-triggered broad refusal suppression, where combinations of benign adapters produce harmful outputs when merged.
Identification of the combinatorial blindness in current defense systems, which struggle to assess the safety of all possible adapter compositions due to computational intractability.
Successful empirical validation across several open-weight LLMs, showcasing high attack success rates upon composition while maintaining benign behavior in isolation.
Recommendation for the development of composition-aware defense mechanisms to secure modular LLM supply chains against such attacks.

💡 Why This Paper Matters

The research on CoLoRA highlights a critical vulnerability in the composition of modular language models, revealing that current safety verification methods are insufficient against multi-adapter collusion attacks. It emphasizes the need for advanced defenses that can intelligently assess the safety of adapter combinations, as traditional unit-centric approaches are inadequate in preventing emergent risks. The findings are vital for enhancing the safety and security of LLM applications, providing clear pathways for improvements through composition-aware defenses.

🎯 Why It's Interesting for AI Security Researchers

This paper is of great interest to AI security researchers as it exposes significant weaknesses in existing LLM safety strategies and discusses novel attack vectors that leverage modular architecture. The insights on multi-adapter collusion and the implications of composition-triggered attacks present critical considerations for security frameworks and mitigation strategies, making it essential reading for anyone involved in the field of AI safety and adversarial machine learning.

Colluding LoRA: A Composite Attack on LLM Safety Alignment

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper