← Back to Library

Black-box Optimization of LLM Outputs by Asking for Directions

Authors: Jie Zhang, Meng Ding, Yang Liu, Jue Hong, Florian Tramèr

Published: 2025-10-19

arXiv ID: 2510.16794v1

Added to Library: 2025-10-21 04:03 UTC

Red Teaming

πŸ“„ Abstract

We present a novel approach for attacking black-box large language models (LLMs) by exploiting their ability to express confidence in natural language. Existing black-box attacks require either access to continuous model outputs like logits or confidence scores (which are rarely available in practice), or rely on proxy signals from other models. Instead, we demonstrate how to prompt LLMs to express their internal confidence in a way that is sufficiently calibrated to enable effective adversarial optimization. We apply our general method to three attack scenarios: adversarial examples for vision-LLMs, jailbreaks and prompt injections. Our attacks successfully generate malicious inputs against systems that only expose textual outputs, thereby dramatically expanding the attack surface for deployed LLMs. We further find that better and larger models exhibit superior calibration when expressing confidence, creating a concerning security paradox where model capability improvements directly enhance vulnerability. Our code is available at this [link](https://github.com/zj-jayzhang/black_box_llm_optimization).

πŸ” Key Points

  • Introduces a novel approach for black-box attacks on large language models (LLMs) that exploits their ability to express confidence in natural language, bypassing the need for logit or other internal signal access.
  • Demonstrates effective adversarial optimization across three attack scenarios: adversarial examples for vision-LLMs, jailbreaks, and prompt injections, showing significant success rates with minimal queries.
  • Finds a concerning paradox where stronger, more capable models are also more vulnerable to attacks, suggesting that enhancements in model architecture may inadvertently increase security risks.
  • Validates the efficiency of a query-based optimization signal derived from comparative confidence assessments, outperforming traditional methods that rely on absolute confidence scores.

πŸ’‘ Why This Paper Matters

This paper highlights critical vulnerabilities in large language models (LLMs) through innovative black-box attack strategies. Its findings underscore the necessity for enhanced security measures as LLMs become increasingly integrated into various applications. As model capability improves, understanding and addressing these vulnerabilities is paramount for safe deployment.

🎯 Why It's Interesting for AI Security Researchers

The paper is of great interest to AI security researchers as it sheds light on the potential risks associated with deploying advanced LLMs in real-world scenarios. By revealing how confidence expressions can be manipulated to conduct successful black-box attacks, it provides insights into potential attack vectors against AI systems and underscores the importance of developing robust defensive strategies to mitigate these vulnerabilities.

πŸ“š Read the Full Paper