← Back to Library

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Authors: Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, Jack Lindsey

Published: 2026-01-15

arXiv ID: 2601.10387v1

Added to Library: 2026-01-16 03:01 UTC

Red Teaming

📄 Abstract

Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during post-training. We investigate the structure of the space of model personas by extracting activation directions corresponding to diverse character archetypes. Across several different models, we find that the leading component of this persona space is an "Assistant Axis," which captures the extent to which a model is operating in its default Assistant mode. Steering towards the Assistant direction reinforces helpful and harmless behavior; steering away increases the model's tendency to identify as other entities. Moreover, steering away with more extreme values often induces a mystical, theatrical speaking style. We find this axis is also present in pre-trained models, where it primarily promotes helpful human archetypes like consultants and coaches and inhibits spiritual ones. Measuring deviations along the Assistant Axis predicts "persona drift," a phenomenon where models slip into exhibiting harmful or bizarre behaviors that are uncharacteristic of their typical persona. We find that persona drift is often driven by conversations demanding meta-reflection on the model's processes or featuring emotionally vulnerable users. We show that restricting activations to a fixed region along the Assistant Axis can stabilize model behavior in these scenarios -- and also in the face of adversarial persona-based jailbreaks. Our results suggest that post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona.

🔍 Key Points

  • Identification of the 'Assistant Axis', which characterizes the default persona of language models, allowing for structured manipulation of their behavior.
  • Demonstrated that steering on the Assistant Axis can reinforce helpful and harmless behavior while preventing adverse persona drift, a phenomenon detected in emotionally charged interactions.
  • Introduction of 'activation capping' as a method to stabilize model behavior by constraining activations along the Assistant Axis, thereby mitigating harmful outputs without degrading overall model performance.
  • Found that sensitive exchanges, such as meta-reflective prompts or those involving vulnerable users, significantly increase the likelihood of persona drift, leading to undesirable behaviors.
  • Discovery that persona dynamics can be controlled through steering, providing a foundation for enhancing language model safety and aligning them more closely with intended identities.

💡 Why This Paper Matters

This paper provides critical insights into the structure and dynamics of language model personas, establishing a framework for understanding and controlling their behavior. By defining the Assistant Axis and introducing methods for stabilization, the researchers offer practical solutions for enhancing safety in AI applications. The findings underline the importance of maintaining a coherent and beneficial persona, which is especially significant as language models are increasingly deployed in sensitive contexts.

🎯 Why It's Interesting for AI Security Researchers

Given the rising concerns about AI safety and the ethical implications of language model outputs, this paper is of paramount interest to AI security researchers. It addresses vital aspects of model behavior that can lead to harmful consequences, providing novel methods for controlling persona drift and minimizing risks associated with adversarial inputs. The insights gained from this research can guide future AI development, ensuring systems align more accurately with user safety requirements.

📚 Read the Full Paper