Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

Authors: Abhilekh Borah, Chhavi Sharma, Danush Khanna, Utkarsh Bhatt, Gurpreet Singh, Hasnat Md Abdullah, Raghav Kaushik Ravi, Vinija Jain, Jyoti Patel, Shubham Singh, Vasu Sharma, Arpita Vats, Rahul Raja, Aman Chadha, Amitava Das

Published: 2025-06-16

arXiv ID: 2506.13901v1

Added to Library: 2025-06-18 03:00 UTC

Red Teaming

📄 Abstract

Alignment is no longer a luxury, it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking. To address this issue, we introduce the Alignment Quality Index (AQI). This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space. By combining measures such as the Davies-Bouldin Score (DBS), Dunn Index (DI), Xie-Beni Index (XBI), and Calinski-Harabasz Index (CHI) across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking, offering a robust, decoding invariant tool for behavior agnostic safety auditing. Additionally, we propose the LITMUS dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO, GRPO, and RLHF conditions demonstrate AQI's correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area.

🔍 Key Points

Introduction of the Alignment Quality Index (AQI), a geometric metric to assess the alignment of large language models (LLMs), highlighting its robustness against existing behavioral metrics.
AQI utilizes clustering indices like the Davies-Bouldin Score, Dunn Index, Xie–Beni Index, and Calinski–Harabasz Index to analyze safety and unsafe activations in latent space, offering an intrinsic diagnostic tool for alignment.
Creation of the LITMUS dataset specifically designed for evaluating model alignment retention and vulnerability during parameter updates, allowing for a nuanced understanding of LLM performance under various conditions.
Empirical validation demonstrates AQI's effectiveness in detecting alignment issues (jailbreaking, alignment faking) that surface-level metrics such as refusal rates fail to identify.
AQI supports a geometry-first approach in alignment auditing and highlights the significance of latent separation for ensuring LLM compliance in high-stakes applications.

💡 Why This Paper Matters

This paper presents a significant advancement in the evaluation of LLMs' alignment with human values, providing a robust, geometry-based metric (AQI). By shifting the focus from output behavior to the underlying representations in latent space, it establishes a more reliable method for detecting hidden alignment failures. The introduction of the LITMUS dataset enhances the capacity for thorough assessment, ensuring safer deployments of LLMs in critical sectors like healthcare and policy-making. As the need for reliable AI systems grows, AQI's contributions are timely and essential in fostering safe, human-aligned AI systems.

🎯 Why It's Interesting for AI Security Researchers

This research is particularly relevant to AI security researchers as it addresses the urgent need for advanced evaluation metrics to assess model safety and compliance. AQI's emphasis on latent space representation provides insights into potential vulnerabilities that are often overlooked by traditional evaluation methods. Furthermore, as LLMs are deployed in sensitive applications, understanding the nuances of model behavior beyond superficial outputs becomes paramount in mitigating risks of misalignment, jailbreaking, and alignment faking. Researchers can leverage AQI and the LITMUS dataset to develop and enforce more secure and trustworthy AI systems.

Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper