COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers

📄 Abstract

This paper studies how multimodal large language models (MLLMs) undermine the security guarantees of visual CAPTCHA. We identify the attack surface where an adversary can cheaply automate CAPTCHA solving using off-the-shelf models. We evaluate 7 leading commercial and open-source MLLMs across 18 real-world CAPTCHA task types, measuring single-shot accuracy, success under limited retries, end-to-end latency, and per-solve cost. We further analyze the impact of task-specific prompt engineering and few-shot demonstrations on solver effectiveness. We reveal that MLLMs can reliably solve recognition-oriented and low-interaction CAPTCHA tasks at human-like cost and latency, whereas tasks requiring fine-grained localization, multi-step spatial reasoning, or cross-frame consistency remain significantly harder for current models. By examining the reasoning traces of such MLLMs, we investigate the underlying mechanisms of why models succeed/fail on specific CAPTCHA puzzles and use these insights to derive defense-oriented guidelines for selecting and strengthening CAPTCHA tasks. We conclude by discussing implications for platform operators deploying CAPTCHA as part of their abuse-mitigation pipeline.Code Availability (https://anonymous.4open.science/r/Captcha-465E/).

🔍 Key Points

This study quantitatively evaluates the ability of multimodal large language models (MLLMs) to solve various visual CAPTCHA types, revealing a significant capability to bypass simpler CAPTCHAs at human-like costs and latency.
A framework evaluating MLLMs was established that considers accuracy, finite-retry behavior, end-to-end latency, and the cost of solving CAPTCHAs, providing a clear understanding of threats posed by MLLMs.
The research identifies a pronounced hardness gap among CAPTCHA task types, with certain categories (recognition-oriented tasks) being broken while others (like fine-grained spatial reasoning tasks) remain robust against current MLLM capabilities.
Insights from MLLM reasoning traces expose common error patterns that inform the design of CAPTCHA tasks that are resilient against automated solving, contributing to the generation of concrete defense-oriented design guidelines.
The paper emphasizes the urgent need for web service operators to adapt their CAPTCHA strategies in light of the evolving capabilities of MLLMs, underlining the real-world implications for online security.

💡 Why This Paper Matters

This paper is critical as it not only highlights the vulnerabilities in current CAPTCHA systems due to the advances in multimodal large language models but also provides foundational insights for improving web security measures. By establishing both a practical evaluation framework and defense-oriented design guidelines, it arms web service operators with valuable strategies to confront the challenges posed by AI-enabled attacks, thereby enhancing the effectiveness of CAPTCHA as a security layer.

🎯 Why It's Interesting for AI Security Researchers

The findings of this paper are of immense interest to AI security researchers as they dissect the interaction between advanced AI capabilities and existing security measures. The detailed analysis of model behaviors in solving CAPTCHAs provides a basis for understanding how future AI models may exploit security protocols, and informs the development of more robust systems that can withstand these emerging threats, thus contributing to the critical field of AI safety and security measures.

COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper