whisper large

openai

Introduction

Whisper is a pre-trained automatic speech recognition (ASR) and speech translation model developed by OpenAI. It is trained on 680,000 hours of labeled data and is capable of generalizing across various datasets and domains without fine-tuning. A subsequent large-v2 model was introduced, trained for 2.5 times more epochs with regularization, surpassing the performance of the original model.

Architecture

Whisper is a Transformer-based encoder-decoder model, also known as a sequence-to-sequence model, trained with large-scale weak supervision. It supports both English-only and multilingual datasets for speech recognition and translation tasks. The model comes in different configurations based on size, with checkpoints available for various model sizes, including tiny, base, small, medium, and large.

Training

The Whisper model was trained on 680,000 hours of audio data, with 65% representing English-language audio. The multilingual model was also trained for speech translation tasks. The model is capable of improving its capabilities through fine-tuning, especially for specific languages and tasks.

Guide: Running Locally

  1. Installation: Install the required libraries, including transformers and datasets, using pip:

    pip install transformers datasets
    
  2. Model and Processor: Load the Whisper model and processor:

    from transformers import WhisperProcessor, WhisperForConditionalGeneration
    
    processor = WhisperProcessor.from_pretrained("openai/whisper-large")
    model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
    
  3. Load Data: Use the datasets library to load your audio data:

    from datasets import load_dataset
    
    ds = load_dataset("librispeech_asr", "clean", split="test")
    
  4. Process and Transcribe: Pre-process the audio and generate transcriptions:

    sample = ds[0]["audio"]
    input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
    predicted_ids = model.generate(input_features)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    
  5. Evaluation: For evaluation, calculate the Word Error Rate (WER) using the evaluate library:

    from evaluate import load
    
    wer = load("wer")
    result = wer.compute(references=[reference_text], predictions=[transcription])
    

Cloud GPUs: For better performance, especially with large models, consider using cloud GPU services such as AWS, Google Cloud, or Azure.

License

Whisper is released under the Apache 2.0 license, allowing for both academic and commercial use.

More Related APIs in Automatic Speech Recognition