whisper base.en

openai

Introduction

Whisper is a pre-trained model developed by OpenAI for automatic speech recognition (ASR) and speech translation. Trained on 680,000 hours of labeled data, it effectively generalizes across various datasets and domains without needing fine-tuning. Whisper supports both English-only and multilingual tasks. For speech recognition, it transcribes audio into the same language, while for translation, it converts speech to another language.

Architecture

Whisper is based on a Transformer encoder-decoder model, functioning as a sequence-to-sequence model. It comes in five configurations of varying sizes, with the smallest supporting both English and multilingual tasks and the largest supporting only multilingual tasks. The model performs well in ASR and speech translation, with checkpoints available on the Hugging Face Hub.

Training

The Whisper models are trained on 680,000 hours of audio data sourced from the internet. The dataset comprises 65% English-language audio, 18% non-English audio with English transcripts, and 17% non-English audio with respective transcripts. Performance correlates with the amount of training data available in each language.

Guide: Running Locally

To run Whisper locally, follow these steps:

  1. Install dependencies: Ensure you have Python installed, then install the transformers, datasets, and evaluate libraries.

  2. Load the model and processor:

    from transformers import WhisperProcessor, WhisperForConditionalGeneration
    processor = WhisperProcessor.from_pretrained("openai/whisper-base.en")
    model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base.en")
    
  3. Prepare audio input: Load your audio data and preprocess it using the Whisper processor.

  4. Generate transcription:

    input_features = processor(audio_data, sampling_rate=sampling_rate, return_tensors="pt").input_features
    predicted_ids = model.generate(input_features)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    
  5. Evaluate performance: Use a dataset like LibriSpeech for evaluation, and compute metrics such as Word Error Rate (WER).

    Cloud GPUs: For faster processing, consider using cloud services like AWS, Google Cloud, or Azure, which offer GPU instances.

License

The Whisper model is released under the Apache 2.0 license, allowing for wide usage and distribution.

More Related APIs in Automatic Speech Recognition