Qwen2 Audio 7 B LLM Model

Introduction

Qwen2-Audio is a series of large audio-language models designed to process various audio inputs and generate textual responses. It supports two modes of interaction: voice chat without text input and audio analysis with text instructions. The series includes the Qwen2-Audio-7B pretrained model and the Qwen2-Audio-7B-Instruct chat model. For further information, refer to the Blog, GitHub, and Report.

Architecture

Qwen2-Audio models are integrated within the Hugging Face Transformers library. They are designed to handle complex audio signals and provide meaningful textual outputs. The models can be used in different interaction modes to support diverse use cases in audio analysis and voice communication.

Training

While specific training details are not provided, the Qwen2-Audio models are pretrained to handle audio inputs and convert them into text. Users can access and deploy these models directly for tasks involving audio analysis and generation.

Guide: Running Locally

To run Qwen2-Audio locally:

Install Dependencies: Ensure you have the latest Hugging Face Transformers by running pip install git+https://github.com/huggingface/transformers.

Load Model and Processor:

from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration

model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B", trust_remote_code=True)

Prepare and Process Audio Input:

from io import BytesIO
from urllib.request import urlopen
import librosa

prompt = "<|audio_bos|><|AUDIO|><|audio_eos|>Generate the caption in English:"
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3"
audio, sr = librosa.load(BytesIO(urlopen(url).read()), sr=processor.feature_extractor.sampling_rate)
inputs = processor(text=prompt, audios=audio, return_tensors="pt")

Generate Text from Audio:

generated_ids = model.generate(**inputs, max_length=256)
generated_ids = generated_ids[:, inputs.input_ids.size(1):]
response = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

Consider Cloud GPUs: For better performance, especially with large models, consider using cloud-based GPUs from providers like AWS, Google Cloud, or Azure.