Introduction

Distil-Whisper is an advanced version of OpenAI's Whisper model, specifically designed for efficient automatic speech recognition (ASR). It is the final installment in the Distil-Whisper English series, optimized for long-form transcription and faster performance. The model achieves a word error rate (WER) within 1% of Whisper large-v3 on long-form audio, outperforming its predecessors.

Architecture

Distil-Whisper employs an encoder-decoder architecture, where the encoder processes speech inputs into hidden states and the decoder predicts text tokens. The model focuses on optimizing the decoder for reduced latency. The encoder is copied from the teacher model and remains fixed, while the decoder is a subset of the teacher's layers, initialized from maximally spaced layers.

Training

The model is trained on 22,000 hours of audio data across diverse domains, using pseudo-labels generated by Whisper large-v3. The training employs a word error rate (WER) filter to ensure accuracy by discarding mis-transcribed examples. The training process involves 80,000 optimization steps over 11 epochs with a batch size of 256.

Guide: Running Locally

  1. Install Dependencies:

    pip install --upgrade pip
    pip install --upgrade transformers accelerate datasets[audio]
    
  2. Set Up Model and Processor:

    import torch
    from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
    
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    model_id = "distil-whisper/distil-large-v3"
    
    model = AutoModelForSpeechSeq2Seq.from_pretrained(
        model_id, torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, low_cpu_mem_usage=True, use_safetensors=True
    )
    model.to(device)
    
  3. Transcribe Audio:

    from datasets import load_dataset
    
    processor = AutoProcessor.from_pretrained(model_id)
    pipe = pipeline(
        "automatic-speech-recognition",
        model=model,
        tokenizer=processor.tokenizer,
        feature_extractor=processor.feature_extractor,
        device=device,
    )
    
    dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
    sample = dataset[0]["audio"]
    result = pipe(sample)
    print(result["text"])
    
  4. Use Cloud GPUs: For enhanced performance, consider using cloud GPUs like those offered by AWS, Google Cloud, or Azure.

License

Distil-Whisper is licensed under the MIT License, inheriting from OpenAI's Whisper model.

More Related APIs in Automatic Speech Recognition