whisper medium.en

openai

Introduction

Whisper is a pre-trained model developed by OpenAI for Automatic Speech Recognition (ASR) and speech translation tasks. It is designed to handle a wide range of datasets and domains without requiring fine-tuning. The model was introduced in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" and has been trained on a large dataset of 680,000 hours of labeled speech data.

Architecture

Whisper is a Transformer-based encoder-decoder model, also known as a sequence-to-sequence model. It comes in several configurations with varying model sizes, ranging from small to large, with some models trained on English-only data and others on multilingual data. The model can perform both speech recognition and speech translation, depending on its training configuration. All pre-trained checkpoints are available on the Hugging Face Hub.

Training

The Whisper model was trained using 680,000 hours of labeled speech data obtained from the internet. This dataset includes both English and non-English audio, with transcripts available in the corresponding languages. The training process involves large-scale weak supervision, which allows the model to generalize well across different domains and languages.

Guide: Running Locally

To run Whisper locally for English speech recognition, follow these steps:

  1. Install Required Libraries:

    pip install transformers datasets torch
    
  2. Load Model and Processor:

    from transformers import WhisperProcessor, WhisperForConditionalGeneration
    
    processor = WhisperProcessor.from_pretrained("openai/whisper-medium.en")
    model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium.en")
    
  3. Load and Process Data:

    from datasets import load_dataset
    
    ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
    sample = ds[0]["audio"]
    input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
    
  4. Generate and Decode Transcriptions:

    predicted_ids = model.generate(input_features)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    

For improved performance, especially for large models, consider using cloud GPUs such as AWS EC2 instances with GPU support, Google Cloud's GPU offerings, or Azure's GPU-powered virtual machines.

License

Whisper is released under the Apache-2.0 license, which allows for both personal and commercial use, modification, and distribution, provided that proper attribution is given to the original authors.

More Related APIs in Automatic Speech Recognition