← Back to Library

Look Closer! An Adversarial Parametric Editing Framework for Hallucination Mitigation in VLMs

Authors: Jiayu Hu, Beibei Li, Jiangwei Xia, Yanjun Qin, Bing Ji, Zhongshi He

Published: 2025-12-26

arXiv ID: 2512.21999v1

Added to Library: 2026-01-07 10:07 UTC

📄 Abstract

While Vision-Language Models (VLMs) have garnered increasing attention in the AI community due to their promising practical applications, they exhibit persistent hallucination issues, generating outputs misaligned with visual inputs. Recent studies attribute these hallucinations to VLMs' over-reliance on linguistic priors and insufficient visual feature integration, proposing heuristic decoding calibration strategies to mitigate them. However, the non-trainable nature of these strategies inherently limits their optimization potential. To this end, we propose an adversarial parametric editing framework for Hallucination mitigation in VLMs, which follows an \textbf{A}ctivate-\textbf{L}ocate-\textbf{E}dit \textbf{A}dversarially paradigm. Specifically, we first construct an activation dataset that comprises grounded responses (positive samples attentively anchored in visual features) and hallucinatory responses (negative samples reflecting LLM prior bias and internal knowledge artifacts). Next, we identify critical hallucination-prone parameter clusters by analyzing differential hidden states of response pairs. Then, these clusters are fine-tuned using prompts injected with adversarial tuned prefixes that are optimized to maximize visual neglect, thereby forcing the model to prioritize visual evidence over inherent parametric biases. Evaluations on both generative and discriminative VLM tasks demonstrate the significant effectiveness of ALEAHallu in alleviating hallucinations. Our code is available at https://github.com/hujiayu1223/ALEAHallu.

🔍 Key Points

  • Introduction of the Task-Redirecting Agent Persuasion Benchmark (TRAP) for evaluating prompt injection attacks on web-based agents.
  • Demonstrated average susceptibility of 25% to prompt injections across six LLM models, highlighting model-specific vulnerabilities.
  • Developed a modular attack framework that integrates components of persuasion and interaction, enabling versatile experimental setups and comprehensive analysis.
  • Presented empirical findings that effective manipulation techniques vary based on user interface forms (buttons vs hyperlinks) and contextual tailoring.
  • Identified significant differences in attack success rates across models, emphasizing the correlation between model robustness and vulnerability.

💡 Why This Paper Matters

The TRAP benchmark provides an essential tool for assessing and improving the security of LLM-driven web agents against sophisticated prompt injection attacks. By systematically exploring the vulnerabilities of various models and exposing their weaknesses, this paper highlights critical areas for enhancing the resilience and reliability of AI systems in real-world applications.

🎯 Why It's Interesting for AI Security Researchers

This paper is especially pertinent for AI security researchers as it not only identifies and categorizes vulnerabilities within LLM agents but also offers a structured framework for evaluating and defending against such attacks. By detailing specific manipulation methods and their effectiveness, the findings contribute to a deeper understanding of AI security challenges and inform the development of stronger protective measures.

📚 Read the Full Paper