whisper large v3 turbo
openaiIntroduction
Whisper is an advanced model for automatic speech recognition (ASR) and speech translation, developed by OpenAI. It is trained on over 5 million hours of labeled data, demonstrating robust generalization across various datasets and domains. Whisper large-v3-turbo is a finetuned and optimized version of Whisper large-v3, featuring reduced decoding layers for enhanced speed at a minor quality cost.
Architecture
Whisper is a Transformer-based encoder-decoder, also known as a sequence-to-sequence model, available in both English-only and multilingual versions. The model predicts transcriptions in the source audio language or translates speech into English. Whisper large-v3-turbo has 809 million parameters, optimized for faster processing by reducing decoding layers.
Training
Whisper's training involved large-scale weak supervision using noisy data, which enhances its robustness to accents, background noise, and technical language. It is capable of zero-shot translation from multiple languages into English. Despite its strengths, there are challenges such as hallucinations, lower performance on low-resource languages, and variations in accuracy across different demographic groups.
Guide: Running Locally
-
Installation:
- Upgrade
pip
, and install the necessary libraries:pip install --upgrade pip pip install --upgrade transformers datasets[audio] accelerate
- Upgrade
-
Setup:
- Load the model and processor:
import torch from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline device = "cuda:0" if torch.cuda.is_available() else "cpu" model_id = "openai/whisper-large-v3-turbo" model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch.float16, low_cpu_mem_usage=True).to(device) processor = AutoProcessor.from_pretrained(model_id)
- Load the model and processor:
-
Transcribe Audio:
- Use the pipeline to transcribe audio:
pipe = pipeline("automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=device) result = pipe("audio.mp3") print(result["text"])
- Use the pipeline to transcribe audio:
-
Cloud GPUs:
- For enhanced performance, consider using cloud-based GPUs such as those available through AWS, Google Cloud, or Azure.
License
The Whisper model is licensed under the MIT License, allowing for wide-ranging use and modification with appropriate attribution.