whisper large v2
openaiIntroduction
Whisper is a pre-trained model by OpenAI designed for automatic speech recognition (ASR) and speech translation, leveraging a large dataset of 680,000 hours for robust performance across multiple languages and domains. The Whisper model aims to generalize without fine-tuning and is available in the large-v2 version, which includes enhanced training for improved results.
Architecture
Whisper utilizes a Transformer-based encoder-decoder architecture, functioning as a sequence-to-sequence model. It is trained on both English-only and multilingual datasets for tasks such as speech recognition and translation. The model supports varying configurations based on size, from tiny to large-v2, with multilingual capabilities. The architecture is designed to handle input audio by converting it into log-Mel spectrograms and processing it through a series of context tokens to perform either transcription or translation.
Training
The Whisper models are trained using 680,000 hours of labeled speech data, collected from various sources on the internet. The training data consists of 65% English, 18% non-English with English translations, and 17% non-English with corresponding transcripts, covering 98 languages. This extensive dataset enables the model to perform well across languages, although there are noted limitations in low-resource languages and specific dialects.
Guide: Running Locally
- Setup: Install the necessary libraries, such as Transformers and Datasets from Hugging Face.
- Model and Processor Loading: Use the
WhisperProcessor
andWhisperForConditionalGeneration
classes to load the Whisper model and processor. - Dataset Preparation: Load an audio dataset for transcription, such as the LibriSpeech dataset.
- Transcription: Process the audio input and generate transcriptions using the model.
- Translation: Setup the task for translation if needed, adjusting the context tokens accordingly.
- Evaluation: Employ evaluation metrics like WER to assess transcription accuracy.
For optimal performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
The Whisper model is released under the Apache-2.0 License, allowing for broad usage while requiring attribution.