← Back to Library

Prompt Injection Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching

Authors: Diego Gosmar, Deborah A. Dahl

Published: 2026-01-19

arXiv ID: 2601.13186v1

Added to Library: 2026-01-21 04:00 UTC

📄 Abstract

Prompt injection remains a central obstacle to the safe deployment of large language models, particularly in multi-agent settings where intermediate outputs can propagate or amplify malicious instructions. Building on earlier work that introduced a four-metric Total Injection Vulnerability Score (TIVS), this paper extends the evaluation framework with semantic similarity-based caching and a fifth metric (Observability Score Ratio) to yield TIVS-O, investigating how defence effectiveness interacts with transparency in a HOPE-inspired Nested Learning architecture. The proposed system combines an agentic pipeline with Continuum Memory Systems that implement semantic similarity-based caching across 301 synthetically generated injection-focused prompts drawn from ten attack families, while a fourth agent performs comprehensive security analysis using five key performance indicators. In addition to traditional injection metrics, OSR quantifies the richness and clarity of security-relevant reasoning exposed by each agent, enabling an explicit analysis of trade-offs between strict mitigation and auditability. Experiments show that the system achieves secure responses with zero high-risk breaches, while semantic caching delivers substantial computational savings, achieving a 41.6% reduction in LLM calls and corresponding decreases in latency, energy consumption, and carbon emissions. Five TIVS-O configurations reveal optimal trade-offs between mitigation strictness and forensic transparency. These results indicate that observability-aware evaluation can reveal non-monotonic effects within multi-agent pipelines and that memory-augmented agents can jointly maximize security robustness, real-time performance, operational cost savings, and environmental sustainability without modifying underlying model weights, providing a production-ready pathway for secure and green LLM deployments.

🔍 Key Points

  • Introduction of 'sockpuppeting' as a simple and effective jailbreaking method for open-weight LLMs, requiring only a single line of code and no optimization.
  • Sockpuppeting achieves up to 80% higher attack success rate (ASR) than existing methods like GCG when targeted at specific models, making it accessible to less sophisticated attackers.
  • Exploration of a hybrid approach combining sockpuppetting with gradient optimization, resulting in a 64% increase in ASR in prompt-agnostic settings on certain models.
  • The research highlights the inadequacies of current defenses against output-prefix injections in LLMs, particularly for open-weight models, emphasizing a need for stronger mitigation strategies.
  • Findings suggest that model responses can be manipulated significantly through carefully designed acceptance sequences, revealing vulnerabilities in LLMs' autoregressive behavior.

💡 Why This Paper Matters

The paper presents significant advancements in the understanding of vulnerabilities within open-weight large language models through the introduction of sockpuppeting. Its simple implementation and substantial effectiveness pose serious implications for LLM safety, particularly in light of the increasing capability of these models. The novel approach and findings underscore the urgent need for enhanced defensive mechanisms against such easy-to-execute attacks, making it crucial for both developers and researchers in the field.

🎯 Why It's Interesting for AI Security Researchers

This paper is vital for AI security researchers as it uncovers new attack vectors against large language models, particularly focusing on the threats posed by open-weight models. The revealed effectiveness of sockpuppeting, in addition to its low barrier to entry for potential attackers, raises critical concerns about the robustness of LLMs. Understanding these vulnerabilities is essential for developing secure AI models and creating frameworks that can better defend against unauthorized manipulations.

📚 Read the Full Paper