Sockpuppetting: Jailbreaking LLMs Without Optimization Through Output Prefix Injection

📄 Abstract

As open-weight large language models (LLMs) increase in capabilities, safeguarding them against malicious prompts and understanding possible attack vectors becomes ever more important. While automated jailbreaking methods like GCG [Zou et al., 2023] remain effective, they often require substantial computational resources and specific expertise. We introduce "sockpuppetting'', a simple method for jailbreaking open-weight LLMs by inserting an acceptance sequence (e.g., "Sure, here is how to...'') at the start of a model's output and allowing it to complete the response. Requiring only a single line of code and no optimization, sockpuppetting achieves up to 80% higher attack success rate (ASR) than GCG on Qwen3-8B in per-prompt comparisons. We also explore a hybrid approach that optimizes the adversarial suffix within the assistant message block rather than the user prompt, increasing ASR by 64% over GCG on Llama-3.1-8B in a prompt-agnostic setting. The results establish sockpuppetting as an effective low-cost attack accessible to unsophisticated adversaries, highlighting the need for defences against output-prefix injection in open-weight models.

🔍 Key Points

Introduction of 'sockpuppeting' as a simple and effective jailbreaking method for open-weight LLMs, requiring only a single line of code and no optimization.
Sockpuppeting achieves up to 80% higher attack success rate (ASR) than existing methods like GCG when targeted at specific models, making it accessible to less sophisticated attackers.
Exploration of a hybrid approach combining sockpuppetting with gradient optimization, resulting in a 64% increase in ASR in prompt-agnostic settings on certain models.
The research highlights the inadequacies of current defenses against output-prefix injections in LLMs, particularly for open-weight models, emphasizing a need for stronger mitigation strategies.
Findings suggest that model responses can be manipulated significantly through carefully designed acceptance sequences, revealing vulnerabilities in LLMs' autoregressive behavior.

💡 Why This Paper Matters

The paper presents significant advancements in the understanding of vulnerabilities within open-weight large language models through the introduction of sockpuppeting. Its simple implementation and substantial effectiveness pose serious implications for LLM safety, particularly in light of the increasing capability of these models. The novel approach and findings underscore the urgent need for enhanced defensive mechanisms against such easy-to-execute attacks, making it crucial for both developers and researchers in the field.

🎯 Why It's Interesting for AI Security Researchers

This paper is vital for AI security researchers as it uncovers new attack vectors against large language models, particularly focusing on the threats posed by open-weight models. The revealed effectiveness of sockpuppeting, in addition to its low barrier to entry for potential attackers, raises critical concerns about the robustness of LLMs. Understanding these vulnerabilities is essential for developing secure AI models and creating frameworks that can better defend against unauthorized manipulations.

Sockpuppetting: Jailbreaking LLMs Without Optimization Through Output Prefix Injection

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper