Qwen2 Audio 7 B
QwenIntroduction
Qwen2-Audio is a series of large audio-language models designed to process various audio inputs and generate textual responses. It supports two modes of interaction: voice chat without text input and audio analysis with text instructions. The series includes the Qwen2-Audio-7B pretrained model and the Qwen2-Audio-7B-Instruct chat model. For further information, refer to the Blog, GitHub, and Report.
Architecture
Qwen2-Audio models are integrated within the Hugging Face Transformers library. They are designed to handle complex audio signals and provide meaningful textual outputs. The models can be used in different interaction modes to support diverse use cases in audio analysis and voice communication.
Training
While specific training details are not provided, the Qwen2-Audio models are pretrained to handle audio inputs and convert them into text. Users can access and deploy these models directly for tasks involving audio analysis and generation.
Guide: Running Locally
To run Qwen2-Audio locally:
- Install Dependencies: Ensure you have the latest Hugging Face Transformers by running
pip install git+https://github.com/huggingface/transformers
. - Load Model and Processor:
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B", trust_remote_code=True) processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B", trust_remote_code=True)
- Prepare and Process Audio Input:
from io import BytesIO from urllib.request import urlopen import librosa prompt = "<|audio_bos|><|AUDIO|><|audio_eos|>Generate the caption in English:" url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/glass-breaking-151256.mp3" audio, sr = librosa.load(BytesIO(urlopen(url).read()), sr=processor.feature_extractor.sampling_rate) inputs = processor(text=prompt, audios=audio, return_tensors="pt")
- Generate Text from Audio:
generated_ids = model.generate(**inputs, max_length=256) generated_ids = generated_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
- Consider Cloud GPUs: For better performance, especially with large models, consider using cloud-based GPUs from providers like AWS, Google Cloud, or Azure.
License
Qwen2-Audio is licensed under the Apache 2.0 License.