CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection

📄 Abstract

Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, their effectiveness in downstream steering tasks is limited by the requirement for contrastive datasets or large activation storage. To address these limitations, we propose CorrSteer, which selects features by correlating sample correctness with SAE activations from generated tokens at inference time. This approach uses only inference-time activations to extract more relevant features, thereby avoiding spurious correlations. It also obtains steering coefficients from average activations, automating the entire pipeline. Our method shows improved task performance on QA, bias mitigation, jailbreaking prevention, and reasoning benchmarks on Gemma 2 2B and LLaMA 3.1 8B, notably achieving a +4.1% improvement in MMLU performance and a +22.9% improvement in HarmBench with only 4000 samples. Selected features demonstrate semantically meaningful patterns aligned with each task's requirements, revealing the underlying capabilities that drive performance. Our work establishes correlationbased selection as an effective and scalable approach for automated SAE steering across language model applications.

🔍 Key Points

Introduction of CorrSteer, a method that enhances task performance in large language models (LLMs) by employing correlation-based feature selection from sparse autoencoders (SAEs) during inference time.
CorrSteer eliminates the need for contrastive datasets and large activation storage, thereby streamlining the feature extraction process and making it more efficient.
Demonstrated significant improvements in multiple benchmarks such as a 4.1% increase in MMLU performance and a 22.9% increase in HarmBench performance using only 4000 samples.
The method effectively identifies semantically meaningful patterns in features, which contributes to improved bias mitigation, reasoning, and overall model safety.
Found that selected features maintain low side effect ratios compared to conventional fine-tuning methods, representing a critical advance in steering methods for AI applications.

💡 Why This Paper Matters

The paper presents CorrSteer as a transformative approach in AI that allows for more efficient and effective feature selection in task steering of large language models. By utilizing correlation-based selection, it not only improves performance across various benchmarks but also enhances safety and interpretability of AI systems. As the demand for robust, safe, and interpretable AI technologies grows, the contributions of this research are timely and significant, setting a new standard in AI steering methodologies.

🎯 Why It's Interesting for AI Security Researchers

AI security researchers would find this paper particularly relevant because it addresses critical challenges in mitigating biases, enhancing robustness against adversarial attacks (such as jailbreaking), and improving the interpretability of AI systems. By showcasing a method that leads to safer AI deployment, CorrSteer opens avenues for developing more resilient systems that preserve user trust while reducing potential risks associated with AI outputs.

CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper