whisper medium.en
openaiIntroduction
Whisper is a pre-trained model developed by OpenAI for Automatic Speech Recognition (ASR) and speech translation tasks. It is designed to handle a wide range of datasets and domains without requiring fine-tuning. The model was introduced in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" and has been trained on a large dataset of 680,000 hours of labeled speech data.
Architecture
Whisper is a Transformer-based encoder-decoder model, also known as a sequence-to-sequence model. It comes in several configurations with varying model sizes, ranging from small to large, with some models trained on English-only data and others on multilingual data. The model can perform both speech recognition and speech translation, depending on its training configuration. All pre-trained checkpoints are available on the Hugging Face Hub.
Training
The Whisper model was trained using 680,000 hours of labeled speech data obtained from the internet. This dataset includes both English and non-English audio, with transcripts available in the corresponding languages. The training process involves large-scale weak supervision, which allows the model to generalize well across different domains and languages.
Guide: Running Locally
To run Whisper locally for English speech recognition, follow these steps:
-
Install Required Libraries:
pip install transformers datasets torch
-
Load Model and Processor:
from transformers import WhisperProcessor, WhisperForConditionalGeneration processor = WhisperProcessor.from_pretrained("openai/whisper-medium.en") model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium.en")
-
Load and Process Data:
from datasets import load_dataset ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") sample = ds[0]["audio"] input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
-
Generate and Decode Transcriptions:
predicted_ids = model.generate(input_features) transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
For improved performance, especially for large models, consider using cloud GPUs such as AWS EC2 instances with GPU support, Google Cloud's GPU offerings, or Azure's GPU-powered virtual machines.
License
Whisper is released under the Apache-2.0 license, which allows for both personal and commercial use, modification, and distribution, provided that proper attribution is given to the original authors.