Concept-Based Dictionary Learning for Inference-Time Safety in Vision Language Action Models

📄 Abstract

Vision Language Action (VLA) models close the perception action loop by translating multimodal instructions into executable behaviors, but this very capability magnifies safety risks: jailbreaks that merely yield toxic text in LLMs can trigger unsafe physical actions in embodied systems. Existing defenses alignment, filtering, or prompt hardening intervene too late or at the wrong modality, leaving fused representations exploitable. We introduce a concept-based dictionary learning framework for inference-time safety control. By constructing sparse, interpretable dictionaries from hidden activations, our method identifies harmful concept directions and applies threshold-based interventions to suppress or block unsafe activations. Experiments on Libero-Harm, BadRobot, RoboPair, and IS-Bench show that our approach achieves state-of-the-art defense performance, cutting attack success rates by over 70\% while maintaining task success. Crucially, the framework is plug-in and model-agnostic, requiring no retraining and integrating seamlessly with diverse VLAs. To our knowledge, this is the first inference-time concept-based safety method for embodied systems, advancing both interpretability and safe deployment of VLA models.

🔍 Key Points

Introduction of RACA, a novel set of representation-aware coverage criteria specifically designed for Large Language Model (LLM) safety testing, addressing scalability and irrelevance issues of traditional neuron-level criteria.
RACA operates through three key stages: identifying safety-critical representations using a calibration set, calculating conceptual activation scores, and computing coverage results based on six comprehensive sub-criteria focused on individual and compositional safety concepts.
Comprehensive experiments demonstrate RACA's superiority over traditional neuron-level coverage metrics, showcasing its effectiveness in identifying high-quality jailbreak prompts and its robust application in real-world scenarios such as test set prioritization and attack prompt sampling.
RACA's design principles ensure it is synonym-insensitive to prevent redundancy, invalid-insensitive to eliminate irrelevant inputs, and jailbreak-sensitive to focus on potential threats, making it a principled evaluation framework for LLM safety.
RACA confirms its generalization ability across various LLM architectures and configurations, proving applicable even when the calibration set size is reduced.

💡 Why This Paper Matters

This paper is significant as it proposes a much-needed solution to the pressing issue of LLM safety testing amidst rising security concerns about harmful content generation. By developing RACA, the authors provide a systematic approach that not only enhances the robustness and effectiveness of safety evaluations for LLMs but also offers practical applications in real-world scenarios. The findings emphasize the critical nature of adapting testing frameworks to the unique characteristics of LLMs, facilitating improved AI safety management.

🎯 Why It's Interesting for AI Security Researchers

This paper would be of keen interest to AI security researchers as it addresses a fundamental challenge in the field—ensuring the safety of LLMs against adversarial attacks, particularly jailbreaks. The introduction of RACA presents a specialized framework that focuses on the unique architectures and operational nuances of LLMs, promising to advance methodologies for evaluating and enhancing the safety of AI systems. The research has implications for developing more robust defenses and tools in the battle against AI misuse, which is a significant concern within the AI community.

Concept-Based Dictionary Learning for Inference-Time Safety in Vision Language Action Models

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper