whisper base

openai

Introduction

Whisper is a pre-trained model designed for automatic speech recognition (ASR) and speech translation. Developed by OpenAI, it is based on a Transformer encoder-decoder architecture and trained on 680k hours of labeled audio data. Whisper is capable of generalizing across a variety of datasets and domains without needing fine-tuning. It supports multilingual capabilities, allowing for both speech recognition and translation tasks.

Architecture

Whisper employs a Transformer-based encoder-decoder setup, commonly known as a sequence-to-sequence model. It is available in five configurations with varying model sizes, ranging from 39M to 1550M parameters. Models are trained on either English-only or multilingual data, with the largest models being exclusively multilingual. The architecture allows for flexibility in handling both transcription and translation tasks by using context tokens to inform the model of the desired output.

Training

The Whisper model was trained using 680,000 hours of audio data collected from the internet, covering 99 languages. The data is split into 65% English audio and transcripts, 18% non-English audio with English transcripts, and 17% non-English audio with corresponding transcripts. This large-scale weakly supervised training enables Whisper to perform robustly across different languages and tasks, although performance varies depending on the language's data availability and resource level.

Guide: Running Locally

  1. Install Dependencies: Ensure you have the transformers and datasets libraries installed via pip.

    pip install transformers datasets
    
  2. Load Model and Processor: Use the Hugging Face library to load the Whisper model and processor.

    from transformers import WhisperProcessor, WhisperForConditionalGeneration
    processor = WhisperProcessor.from_pretrained("openai/whisper-base")
    model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
    
  3. Process Audio Input: Pre-process audio inputs using the processor to convert them to log-Mel spectrograms.

    from datasets import load_dataset
    ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
    sample = ds[0]["audio"]
    input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
    
  4. Generate Transcription: Use the model to generate token IDs, then decode them to text.

    predicted_ids = model.generate(input_features)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    
  5. Use Cloud GPUs: For optimal performance, especially with larger model configurations, consider using cloud-based GPUs through platforms like AWS, Google Cloud, or Azure.

License

Whisper is released under the Apache 2.0 license, permitting free use, distribution, and modification, provided that the license terms are met.

More Related APIs in Automatic Speech Recognition