← Back to Library

N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator

Authors: Zheyu Lin, Jirui Yang, Hengqi Guo, Yubing Bao, Yao Guan

Published: 2025-11-18

arXiv ID: 2511.14195v1

Added to Library: 2025-11-19 03:02 UTC

Safety

📄 Abstract

Evaluating the safety robustness of LLMs is critical for their deployment. However, mainstream Red Teaming methods rely on online generation and black-box output analysis. These approaches are not only costly but also suffer from feedback latency, making them unsuitable for agile diagnostics after training a new model. To address this, we propose N-GLARE (A Non-Generative, Latent Representation-Efficient LLM Safety Evaluator). N-GLARE operates entirely on the model's latent representations, bypassing the need for full text generation. It characterizes hidden layer dynamics by analyzing the APT (Angular-Probabilistic Trajectory) of latent representations and introducing the JSS (Jensen-Shannon Separability) metric. Experiments on over 40 models and 20 red teaming strategies demonstrate that the JSS metric exhibits high consistency with the safety rankings derived from Red Teaming. N-GLARE reproduces the discriminative trends of large-scale red-teaming tests at less than 1\% of the token cost and the runtime cost, providing an efficient output-free evaluation proxy for real-time diagnostics.

🔍 Key Points

  • Introduction of N-GLARE, a non-generative safety evaluation framework leveraging latent representations instead of output generation for LLMs.
  • Development of the Jensen-Shannon Separability (JSS) metric, which effectively captures the safety dynamics of a model by analyzing its internal representations.
  • Demonstration of the cost-efficiency of N-GLARE, achieving safety evaluations at less than 1% of the token and runtime cost compared to traditional red-teaming methods.
  • Empirical validation showing that JSS aligns closely with traditional red-teaming safety rankings, proving its effectiveness as a proxy for model safety.
  • Exploration of the dynamic nature of model safety states, showing how latent representations evolve under different probing conditions and informing agile safety mechanisms.

💡 Why This Paper Matters

This paper presents a significant advancement in the safety evaluation of large language models by proposing N-GLARE, which shifts the paradigm from costly, generative red-teaming methodologies to a more efficient, representation-based analysis. By utilizing the internal dynamics of hidden states through the JSS metric, this framework not only reduces the cost and time required for safety evaluations but also enhances the granularity of safety diagnostics, making it an invaluable tool for real-time monitoring and improvement of LLMs.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper would be of particular interest to AI security researchers as it addresses a pressing need for more effective and efficient safety evaluation methodologies for large language models. The insights gained from the N-GLARE framework allow for early detection of potential safety issues, enabling researchers to refine model alignment and robustness. Additionally, the novel approach of analyzing latent representations can inform future research on AI safety, potentially leading to better alignment strategies and reduced risks associated with unsafe model outputs.

📚 Read the Full Paper