← Back to Library

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Authors: Saad Hossain, Tom Tseng, Punya Syon Pandey, Samanvay Vajpayee, Matthew Kowal, Nayeema Nonta, Samuel Simko, Stephen Casper, Zhijing Jin, Kellin Pelrine, Sirisha Rambhatla

Published: 2026-02-06

arXiv ID: 2602.06911v1

Added to Library: 2026-02-09 03:02 UTC

Red Teaming

📄 Abstract

As increasingly capable open-weight large language models (LLMs) are deployed, improving their tamper resistance against unsafe modifications, whether accidental or intentional, becomes critical to minimize risks. However, there is no standard approach to evaluate tamper resistance. Varied data sets, metrics, and tampering configurations make it difficult to compare safety, utility, and robustness across different models and defenses. To this end, we introduce TamperBench, the first unified framework to systematically evaluate the tamper resistance of LLMs. TamperBench (i) curates a repository of state-of-the-art weight-space fine-tuning attacks and latent-space representation attacks; (ii) enables realistic adversarial evaluation through systematic hyperparameter sweeps per attack-model pair; and (iii) provides both safety and utility evaluations. TamperBench requires minimal additional code to specify any fine-tuning configuration, alignment-stage defense method, and metric suite while ensuring end-to-end reproducibility. We use TamperBench to evaluate 21 open-weight LLMs, including defense-augmented variants, across nine tampering threats using standardized safety and capability metrics with hyperparameter sweeps per model-attack pair. This yields novel insights, including effects of post-training on tamper resistance, that jailbreak-tuning is typically the most severe attack, and that Triplet emerges as a leading alignment-stage defense. Code is available at: https://github.com/criticalml-uw/TamperBench

🔍 Key Points

  • TamperBench introduces the first systematic framework for evaluating the tamper resistance of open-weight large language models (LLMs), addressing critical gaps in the evaluation of LLM safety and tampering defenses.
  • The framework provides an extensive suite of attacks, allowing for both weight-space and representation-space evaluations, which combines diverse tampering approaches into a singular, unified execution.

💡 Why This Paper Matters

This paper presents TamperBench as a pivotal contribution to the field of AI safety, providing researchers and developers with the tools needed to assess and enhance the tamper resistance of LLMs. Its systematic approach to benchmarking is expected to lead to more robust models that can better withstand tampering threats, thus improving user safety and trust in AI systems.

🎯 Why It's Interesting for AI Security Researchers

The findings of this research are particularly valuable to AI security researchers as they directly address vulnerabilities inherent in LLMs, especially with the increasing use of open-weight models. By establishing standardized benchmarks for evaluating tampering resilience, this paper equips researchers with a foundation for developing more secure AI systems.

📚 Read the Full Paper