← Back to Library

Semantics as a Shield: Label Disguise Defense (LDD) against Prompt Injection in LLM Sentiment Classification

Authors: Yanxi Li, Ruocheng Shan

Published: 2025-11-23

arXiv ID: 2511.21752v1

Added to Library: 2025-12-01 03:02 UTC

Red Teaming

📄 Abstract

Large language models are increasingly used for text classification tasks such as sentiment analysis, yet their reliance on natural language prompts exposes them to prompt injection attacks. In particular, class-directive injections exploit knowledge of the model's label set (e.g., positive vs. negative) to override its intended behavior through adversarial instructions. Existing defenses, such as detection-based filters, instruction hierarchies, and signed prompts, either require model retraining or remain vulnerable to obfuscation. This paper introduces Label Disguise Defense (LDD), a lightweight and model-agnostic strategy that conceals true labels by replacing them with semantically transformed or unrelated alias labels(e.g., blue vs. yellow). The model learns these new label mappings implicitly through few-shot demonstrations, preventing direct correspondence between injected directives and decision outputs. We evaluate LDD across nine state-of-the-art models, including GPT-5, GPT-4o, LLaMA3.2, Gemma3, and Mistral variants, under varying few-shot and an adversarial setting. Our results show that the ability of LDD to recover performance lost to the adversarial attack varies across models and alias choices. For every model evaluated, LDD is able to restore a portion of the accuracy degradation caused by the attack. Moreover, for the vast majority of models, we can identify more than one alias pair that achieves higher accuracy than the under-attack baseline, in which the model relies solely on few-shot learning without any defensive mechanism. A linguistic analysis further reveals that semantically aligned alias labels(e.g., good vs. bad) yield stronger robustness than unaligned symbols(e.g., blue vs. yellow). Overall, this study demonstrates that label semantics can serve as an effective defense layer, transforming meaning itself into a shield against prompt injection.

🔍 Key Points

  • Introduction of Label Disguise Defense (LDD) as a new lightweight and model-agnostic defense mechanism against prompt injection attacks in large language models (LLMs) used for sentiment classification.
  • Effective concealment of true labels via the use of semantically transformed or unrelated alias labels, allowing models to learn new label semantics through few-shot demonstrations.
  • Evaluation of LDD's effectiveness across nine state-of-the-art models, demonstrating that LDD can restore model accuracy lost due to adversarial attacks with some alias label pairs outperforming baseline accuracy without defense.
  • A linguistic analysis showing that semantically aligned alias labels yield stronger robustness against injection compared to unrelated symbols, emphasizing the importance of label semantics in model training.
  • The paper highlights the limitations of existing prompt injection defenses and positions LDD as an efficient alternative that does not require model retraining or access to internal mechanisms.

💡 Why This Paper Matters

This paper is significant as it presents a novel approach to defend against prompt injection attacks, which pose a substantial risk to the reliability and accuracy of large language models in crucial tasks like sentiment classification. By demonstrating the effectiveness of semantically disguised labels, the research offers a promising direction for enhancing the security and robustness of AI systems without necessitating extensive model modifications or retraining.

🎯 Why It's Interesting for AI Security Researchers

This research is particularly relevant for AI security researchers as it addresses a critical vulnerability in AIs that are increasingly deployed in sensitive applications. Understanding and mitigating the risks associated with prompt injections can help in developing more resilient AI systems, improving their trustworthiness in real-world scenarios. The exploration of semantics in defense strategies also opens new avenues for securing machine learning models against evolving adversarial strategies.

📚 Read the Full Paper