whisper large v3

openai

Whisper Large-v3

Introduction

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, developed by OpenAI. It is trained on over 5 million hours of labeled data and is designed to generalize across various datasets and domains without additional training.

Architecture

Whisper large-v3 maintains the architecture of its predecessors but introduces two main changes: the use of 128 Mel frequency bins for spectrogram inputs and the inclusion of a new language token for Cantonese. It is trained on a mixture of 1 million hours of weakly labeled and 4 million hours of pseudo-labeled audio.

Training

The Whisper large-v3 model was trained for 2.0 epochs on a diverse audio dataset, achieving a 10-20% reduction in errors compared to previous versions. It demonstrates robust performance across multiple languages and is particularly effective in zero-shot settings.

Guide: Running Locally

  1. Install Necessary Libraries

    pip install --upgrade pip
    pip install --upgrade transformers datasets[audio] accelerate
    
  2. Load and Run the Model

    import torch
    from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
    from datasets import load_dataset
    
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
    
    model_id = "openai/whisper-large-v3"
    
    model = AutoModelForSpeechSeq2Seq.from_pretrained(
        model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
    )
    model.to(device)
    
    processor = AutoProcessor.from_pretrained(model_id)
    
    pipe = pipeline(
        "automatic-speech-recognition",
        model=model,
        tokenizer=processor.tokenizer,
        feature_extractor=processor.feature_extractor,
        torch_dtype=torch_dtype,
        device=device,
    )
    
    dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
    sample = dataset[0]["audio"]
    
    result = pipe(sample)
    print(result["text"])
    
  3. Transcribe Local Audio Files

    result = pipe("audio.mp3")
    
  4. Enable Cloud GPUs
    For better performance, especially with large models, consider using cloud-based GPUs such as those provided by AWS, Google Cloud, or Azure.

License

Whisper is licensed under the Apache 2.0 License, allowing for both commercial and non-commercial use.

More Related APIs in Automatic Speech Recognition