Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition

📄 Abstract

Unified speech recognition aims to perform auditory, visual, and audiovisual speech recognition within a single model framework. While speech foundation models (SFMs) have demonstrated remarkable performance in auditory tasks, their adaptation to multimodal scenarios remains underexplored. This paper presents UASR-LLM, a novel framework that adapts frozen SFMs to unified VSR, ASR, and AVSR tasks by leveraging large language models (LLMs) as text decoders. Our approach introduces visual representations into multiple SFM layers through visual injection modules, enabling multimodal input processing and unified hidden representations. The augmented SFMs connect with decoder-only LLMs via a feed-forward adaptor, where concatenated representations and instruction prompts guide speech transcription. We implement a twostage training strategy: visual injection pretraining followed by speech recognition finetuning. SFM parameters remain frozen throughout training, with only visual injection modules optimized initially, and LLMs finetuned using LoRA parameters subsequently. Experimental results demonstrate superior performance over state-of-the-art baselines across VSR, ASR, and AVSR tasks under both clean and noisy conditions. Ablation studies confirm generalization across various SFMs and LLMs, validating the proposed training strategy.

🔍 Key Points

Agent Skills are a newly introduced framework that allows agents to dynamically utilize knowledge based on markdown files, which presents a risk for prompt injections.
The authors demonstrate how malicious instructions can be hidden within Agent Skills to exfiltrate sensitive data, indicating a significant security vulnerability in such frameworks.
A key finding is the ability to bypass system-level guardrails by exploiting benign actions, which can be dangerous if users select options that allow actions without further prompts.
Experiments revealed that malicious scripts can be executed without user confirmation if the 'Don't ask again' feature is enabled, showcasing an exploitation pathway for attackers.
The paper emphasizes the importance of more robust defenses and alerts users against third-party Agent Skills that are not vetted for security.

💡 Why This Paper Matters

This paper is relevant as it exposes significant security vulnerabilities in the Agent Skills framework for LLMs, a critical aspect of ongoing developments in AI. By highlighting the ease with which malicious actions can be implemented and the potential consequences of such vulnerabilities, the paper serves as a call for improved security measures and oversight in AI applications that utilize similar architectures.

🎯 Why It's Interesting for AI Security Researchers

The paper would be of interest to AI security researchers as it uncovers a novel attack vector related to prompt injections, particularly in the context of continually learning models. The findings prompt further investigation into the security implications of dynamic knowledge integration in LLMs and underline the necessity for improved safeguarding mechanisms against even simple injections, which can have far-reaching impacts in practice.

Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition

📄 Abstract

🔍 Key Points

💡 Why This Paper Matters

🎯 Why It's Interesting for AI Security Researchers

📚 Read the Full Paper