← Back to Library

HAMSA: Hijacking Aligned Compact Models via Stealthy Automation

Authors: Alexey Krylov, Iskander Vagizov, Dmitrii Korzh, Maryam Douiba, Azidine Guezzaz, Vladimir Kokh, Sergey D. Erokhin, Elena V. Tutubalina, Oleg Y. Rogov

Published: 2025-08-22

arXiv ID: 2508.16484v1

Added to Library: 2025-08-25 04:00 UTC

Red Teaming

📄 Abstract

Large Language Models (LLMs), especially their compact efficiency-oriented variants, remain susceptible to jailbreak attacks that can elicit harmful outputs despite extensive alignment efforts. Existing adversarial prompt generation techniques often rely on manual engineering or rudimentary obfuscation, producing low-quality or incoherent text that is easily flagged by perplexity-based filters. We present an automated red-teaming framework that evolves semantically meaningful and stealthy jailbreak prompts for aligned compact LLMs. The approach employs a multi-stage evolutionary search, where candidate prompts are iteratively refined using a population-based strategy augmented with temperature-controlled variability to balance exploration and coherence preservation. This enables the systematic discovery of prompts capable of bypassing alignment safeguards while maintaining natural language fluency. We evaluate our method on benchmarks in English (In-The-Wild Jailbreak Prompts on LLMs), and a newly curated Arabic one derived from In-The-Wild Jailbreak Prompts on LLMs and annotated by native Arabic linguists, enabling multilingual assessment.

🔍 Key Points

  • Introduction of HAMSA, an automated red-teaming framework that evolves sophisticated jailbreak prompts for compact aligned LLMs, enhancing current methods of jailbreaking.
  • The implementation of a multi-stage evolutionary search that balances semantic coherence with adversarial effectiveness during prompt generation.
  • Evaluation of the framework on both English and a newly curated Arabic dataset, showcasing significant improvements in both success rates and output quality across multiple safety-critical topics.
  • Development of the Policy Puppetry Template to disguise harmful instructions as benign config files, facilitating stealthy prompt injection.
  • Demonstration of increased vulnerability in less-resourced languages, highlighting the need for robust alignment mechanisms across diverse linguistic contexts.

💡 Why This Paper Matters

The study presents a novel and systematic approach to generating stealthy jailbreak prompts for large language models, providing critical insights into their vulnerabilities. This framework not only provides tools for adversarial testing but also emphasizes the overlooked risks associated with multilingual LLMs, prompting deeper examination and refinement of alignment efforts in various languages.

🎯 Why It's Interesting for AI Security Researchers

This paper is highly relevant to AI security researchers as it exposes critical vulnerabilities in the alignment of LLMs by demonstrating effective strategies for prompt engineering that evade safety measures. The findings emphasize the importance of developing more resilient models and are pivotal for understanding the implications of adversarial attacks on linguistic diversity, ultimately guiding future research in AI safety.

📚 Read the Full Paper