whisper base
openaiIntroduction
Whisper is a pre-trained model designed for automatic speech recognition (ASR) and speech translation. Developed by OpenAI, it is based on a Transformer encoder-decoder architecture and trained on 680k hours of labeled audio data. Whisper is capable of generalizing across a variety of datasets and domains without needing fine-tuning. It supports multilingual capabilities, allowing for both speech recognition and translation tasks.
Architecture
Whisper employs a Transformer-based encoder-decoder setup, commonly known as a sequence-to-sequence model. It is available in five configurations with varying model sizes, ranging from 39M to 1550M parameters. Models are trained on either English-only or multilingual data, with the largest models being exclusively multilingual. The architecture allows for flexibility in handling both transcription and translation tasks by using context tokens to inform the model of the desired output.
Training
The Whisper model was trained using 680,000 hours of audio data collected from the internet, covering 99 languages. The data is split into 65% English audio and transcripts, 18% non-English audio with English transcripts, and 17% non-English audio with corresponding transcripts. This large-scale weakly supervised training enables Whisper to perform robustly across different languages and tasks, although performance varies depending on the language's data availability and resource level.
Guide: Running Locally
-
Install Dependencies: Ensure you have the
transformers
anddatasets
libraries installed via pip.pip install transformers datasets
-
Load Model and Processor: Use the Hugging Face library to load the Whisper model and processor.
from transformers import WhisperProcessor, WhisperForConditionalGeneration processor = WhisperProcessor.from_pretrained("openai/whisper-base") model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
-
Process Audio Input: Pre-process audio inputs using the processor to convert them to log-Mel spectrograms.
from datasets import load_dataset ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") sample = ds[0]["audio"] input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
-
Generate Transcription: Use the model to generate token IDs, then decode them to text.
predicted_ids = model.generate(input_features) transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
-
Use Cloud GPUs: For optimal performance, especially with larger model configurations, consider using cloud-based GPUs through platforms like AWS, Google Cloud, or Azure.
License
Whisper is released under the Apache 2.0 license, permitting free use, distribution, and modification, provided that the license terms are met.