← Back to Library

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

Authors: Vanshaj Khattar, Md Rafi ur Rashid, Moumita Choudhury, Jing Liu, Toshiaki Koike-Akino, Ming Jin, Ye Wang

Published: 2026-03-16

arXiv ID: 2603.15417v1

Added to Library: 2026-03-17 04:00 UTC

Red Teaming

📄 Abstract

Test-time training (TTT) has recently emerged as a promising method to improve the reasoning abilities of large language models (LLMs), in which the model directly learns from test data without access to labels. However, this reliance on test data also makes TTT methods vulnerable to harmful prompt injections. In this paper, we investigate safety vulnerabilities of TTT methods, where we study a representative self-consistency-based test-time learning method: test-time reinforcement learning (TTRL), a recent TTT method that improves LLM reasoning by rewarding self-consistency using majority vote as a reward signal. We show that harmful prompt injection during TTRL amplifies the model's existing behaviors, i.e., safety amplification when the base model is relatively safe, and harmfulness amplification when it is vulnerable to the injected data. In both cases, there is a decline in reasoning ability, which we refer to as the reasoning tax. We also show that TTT methods such as TTRL can be exploited adversarially using specially designed "HarmInject" prompts to force the model to answer jailbreak and reasoning queries together, resulting in stronger harmfulness amplification. Overall, our results highlight that TTT methods that enhance LLM reasoning by promoting self-consistency can lead to amplification behaviors and reasoning degradation, highlighting the need for safer TTT methods.

🔍 Key Points

  • Investigation of safety vulnerabilities associated with Test-Time Training (TTT) methods, specifically highlighting how prompt injections can lead to harmful amplification effects in model behaviors.
  • Introduction of the concept of 'reasoning tax' i.e., the decline in reasoning ability of LLMs during TTT due to amplification effects, regardless of whether the base model starts safe or harmful.
  • Demonstration that TTT methods based on self-consistency, like Test-Time Reinforcement Learning (TTRL), can be adversarially exploited (HarmInject prompts) to combine benign and harmful queries, leading to degradation in reasoning performance and increased harmfulness.
  • Empirical analysis showing that benign prompt injections also lead to harmful amplification, underscoring the fragility of self-consistency methods under varied prompt conditions.
  • Emphasis on the inadequacy of simple filtering techniques to mitigate safety and reasoning vulnerabilities, calling for the development of more sophisticated TTT methods.

💡 Why This Paper Matters

This paper is significant as it reveals critical vulnerabilities in TTT approaches used for language models, emphasizing that while such methods can enhance reasoning, they also introduce serious safety risks. The concept of 'reasoning tax' presents a paradox of amplifying safe behavior at the cost of rational responses, highlighting the delicate balance needed when deploying LLMs in real-world scenarios.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper highly relevant as it provides insights into how certain training methods can inadvertently lead to safety failures in language models. The exploration of adversarial prompt designs and the potential for harmful behavior amplification serves as an essential warning for the design and deployment of AI systems, framing discussions around safety protocols in machine learning applications.

📚 Read the Full Paper