Hidden State Poisoning Attacks against Mamba-based Language Models

📄 Abstract

State space models (SSMs) like Mamba offer efficient alternatives to Transformer-based language models, with linear time complexity. Yet, their adversarial robustness remains critically unexplored. This paper studies the phenomenon whereby specific short input phrases induce a partial amnesia effect in such models, by irreversibly overwriting information in their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench25 allows evaluating a model's information retrieval capabilities when subject to HiSPAs, and confirms the vulnerability of SSMs against such attacks. Even a recent 52B hybrid SSM-Transformer model from the Jamba family collapses on RoBench25 under optimized HiSPA triggers, whereas pure Transformers do not. We also observe that HiSPA triggers significantly weaken the Jamba model on the popular Open-Prompt-Injections benchmark, unlike pure Transformers. Finally, our interpretability study reveals patterns in Mamba's hidden layers during HiSPAs that could be used to build a HiSPA mitigation system. The full code and data to reproduce the experiments can be found at https://anonymous.4open.science/r/hispa_anonymous-5DB0.

🔍 Key Points

The paper introduces Hidden State Poisoning Attacks (HiSPAs) that exploit the unique design of state space models (SSMs) like Mamba to induce partial amnesia by overwriting information in their hidden states.
The authors develop and validate a robustness benchmark, RoBench-25, aimed at evaluating the information retrieval capabilities of models against HiSPAs, demonstrating severe vulnerabilities in SSMs compared to traditional Transformers.
The study reveals that even advanced hybrid SSM-Transformer models, such as the Jamba family, exhibit significant degradation in performance when exposed to optimized HiSPA triggers, unlike pure Transformers which remain robust.
Interpretability studies reveal mechanistic insights into the failure modes of Mamba during HiSPAs, allowing for potential future mitigation strategies against these attacks.
The paper draws connections between HiSPAs and existing Prompt Injection Attacks (PIAs), outlining how hidden state corruption can exacerbate the vulnerability to PIAs in hybrid models.

💡 Why This Paper Matters

This paper is critically relevant as it unveils a significant vulnerability in state space models, which are emerging as efficient alternatives to Transformers in large language models (LLMs). The introduction of HiSPAs not only highlights the need for robust evaluations of SSMs before deployment but also opens avenues for developing defensive strategies against such adversarial threats. As these models are increasingly used in production environments, understanding and mitigating these risks will be crucial for maintaining reliable AI systems.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers will find this paper of great interest due to its focus on novel adversarial attack methodologies that target new LLM architectures. The identification of HiSPAs presents a fresh perspective on potential vulnerabilities specific to SSMs, urging researchers to address these risks in future model designs and security protocols. Additionally, the implications of these findings on prompt injection vulnerabilities bridge existing knowledge with new research, ensuring a holistic understanding of adversarial threats in AI.

Hidden State Poisoning Attacks against Mamba-based Language Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper