← Back to Library

Hidden State Poisoning Attacks against Mamba-based Language Models

Authors: Alexandre Le Mercier, Chris Develder, Thomas Demeester

Published: 2026-01-05

arXiv ID: 2601.01972v2

Added to Library: 2026-01-07 10:02 UTC

Red Teaming

📄 Abstract

State space models (SSMs) like Mamba offer efficient alternatives to Transformer-based language models, with linear time complexity. Yet, their adversarial robustness remains critically unexplored. This paper studies the phenomenon whereby specific short input phrases induce a partial amnesia effect in such models, by irreversibly overwriting information in their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench25 allows evaluating a model's information retrieval capabilities when subject to HiSPAs, and confirms the vulnerability of SSMs against such attacks. Even a recent 52B hybrid SSM-Transformer model from the Jamba family collapses on RoBench25 under optimized HiSPA triggers, whereas pure Transformers do not. We also observe that HiSPA triggers significantly weaken the Jamba model on the popular Open-Prompt-Injections benchmark, unlike pure Transformers. Finally, our interpretability study reveals patterns in Mamba's hidden layers during HiSPAs that could be used to build a HiSPA mitigation system. The full code and data to reproduce the experiments can be found at https://anonymous.4open.science/r/hispa_anonymous-5DB0.

🔍 Key Points

  • Introduction of Hidden State Poisoning Attacks (HiSPAs) which exploit the unique design of Mamba-based language models to irreversibly overwrite information in their hidden states.
  • Creation of the RoBench-25 benchmark to evaluate the susceptibility of language models to HiSPAs, revealing severe performance degradation in SSMs, especially Mamba, compared to Transformers.
  • Demonstration that even large hybrid SSM-Transformer models like Jamba are vulnerable to optimized HiSPAs, which amplify the impact of existing Prompt Injection Attacks (PIAs).
  • Mechanistic interpretability study identifies specific blocks (28-37) in the Mamba architecture that are critical for triggering and detecting HiSPAs, providing a foundation for developing countermeasures.
  • Recommendations for defensive strategies in monitoring and mitigating HiSPAs through norm-based detection in critical blocks.

💡 Why This Paper Matters

This paper is significant as it uncovers a crucial vulnerability in state space models, specifically Mamba, and offers systematic evaluation methods for their robustness against adversarial attacks. Given the rising adoption of such models in practical applications, understanding and mitigating these vulnerabilities is essential for ensuring the reliability and security of language models in real-world deployments.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper are highly relevant to AI security researchers as they highlight a novel attack vector (HiSPAs) that directly compromises the integrity of SSM-based language models, which are becoming increasingly popular due to their efficiency. The implications of these vulnerabilities extend to any systems utilizing such models, underscoring the need for thorough security measures, and paving the way for future research on adversarial robustness in advanced language modeling architectures.

📚 Read the Full Paper