whisper large
openaiIntroduction
Whisper is a pre-trained automatic speech recognition (ASR) and speech translation model developed by OpenAI. It is trained on 680,000 hours of labeled data and is capable of generalizing across various datasets and domains without fine-tuning. A subsequent large-v2 model was introduced, trained for 2.5 times more epochs with regularization, surpassing the performance of the original model.
Architecture
Whisper is a Transformer-based encoder-decoder model, also known as a sequence-to-sequence model, trained with large-scale weak supervision. It supports both English-only and multilingual datasets for speech recognition and translation tasks. The model comes in different configurations based on size, with checkpoints available for various model sizes, including tiny, base, small, medium, and large.
Training
The Whisper model was trained on 680,000 hours of audio data, with 65% representing English-language audio. The multilingual model was also trained for speech translation tasks. The model is capable of improving its capabilities through fine-tuning, especially for specific languages and tasks.
Guide: Running Locally
-
Installation: Install the required libraries, including
transformers
anddatasets
, using pip:pip install transformers datasets
-
Model and Processor: Load the Whisper model and processor:
from transformers import WhisperProcessor, WhisperForConditionalGeneration processor = WhisperProcessor.from_pretrained("openai/whisper-large") model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
-
Load Data: Use the
datasets
library to load your audio data:from datasets import load_dataset ds = load_dataset("librispeech_asr", "clean", split="test")
-
Process and Transcribe: Pre-process the audio and generate transcriptions:
sample = ds[0]["audio"] input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features predicted_ids = model.generate(input_features) transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
-
Evaluation: For evaluation, calculate the Word Error Rate (WER) using the
evaluate
library:from evaluate import load wer = load("wer") result = wer.compute(references=[reference_text], predictions=[transcription])
Cloud GPUs: For better performance, especially with large models, consider using cloud GPU services such as AWS, Google Cloud, or Azure.
License
Whisper is released under the Apache 2.0 license, allowing for both academic and commercial use.