distil large v3
distil-whisperIntroduction
Distil-Whisper is an advanced version of OpenAI's Whisper model, specifically designed for efficient automatic speech recognition (ASR). It is the final installment in the Distil-Whisper English series, optimized for long-form transcription and faster performance. The model achieves a word error rate (WER) within 1% of Whisper large-v3 on long-form audio, outperforming its predecessors.
Architecture
Distil-Whisper employs an encoder-decoder architecture, where the encoder processes speech inputs into hidden states and the decoder predicts text tokens. The model focuses on optimizing the decoder for reduced latency. The encoder is copied from the teacher model and remains fixed, while the decoder is a subset of the teacher's layers, initialized from maximally spaced layers.
Training
The model is trained on 22,000 hours of audio data across diverse domains, using pseudo-labels generated by Whisper large-v3. The training employs a word error rate (WER) filter to ensure accuracy by discarding mis-transcribed examples. The training process involves 80,000 optimization steps over 11 epochs with a batch size of 256.
Guide: Running Locally
-
Install Dependencies:
pip install --upgrade pip pip install --upgrade transformers accelerate datasets[audio]
-
Set Up Model and Processor:
import torch from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline device = "cuda:0" if torch.cuda.is_available() else "cpu" model_id = "distil-whisper/distil-large-v3" model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, low_cpu_mem_usage=True, use_safetensors=True ) model.to(device)
-
Transcribe Audio:
from datasets import load_dataset processor = AutoProcessor.from_pretrained(model_id) pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=device, ) dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") sample = dataset[0]["audio"] result = pipe(sample) print(result["text"])
-
Use Cloud GPUs: For enhanced performance, consider using cloud GPUs like those offered by AWS, Google Cloud, or Azure.
License
Distil-Whisper is licensed under the MIT License, inheriting from OpenAI's Whisper model.