whisper large v3 turbo

openai

Introduction

Whisper is an advanced model for automatic speech recognition (ASR) and speech translation, developed by OpenAI. It is trained on over 5 million hours of labeled data, demonstrating robust generalization across various datasets and domains. Whisper large-v3-turbo is a finetuned and optimized version of Whisper large-v3, featuring reduced decoding layers for enhanced speed at a minor quality cost.

Architecture

Whisper is a Transformer-based encoder-decoder, also known as a sequence-to-sequence model, available in both English-only and multilingual versions. The model predicts transcriptions in the source audio language or translates speech into English. Whisper large-v3-turbo has 809 million parameters, optimized for faster processing by reducing decoding layers.

Training

Whisper's training involved large-scale weak supervision using noisy data, which enhances its robustness to accents, background noise, and technical language. It is capable of zero-shot translation from multiple languages into English. Despite its strengths, there are challenges such as hallucinations, lower performance on low-resource languages, and variations in accuracy across different demographic groups.

Guide: Running Locally

  1. Installation:

    • Upgrade pip, and install the necessary libraries:
      pip install --upgrade pip
      pip install --upgrade transformers datasets[audio] accelerate
      
  2. Setup:

    • Load the model and processor:
      import torch
      from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
      
      device = "cuda:0" if torch.cuda.is_available() else "cpu"
      model_id = "openai/whisper-large-v3-turbo"
      
      model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch.float16, low_cpu_mem_usage=True).to(device)
      processor = AutoProcessor.from_pretrained(model_id)
      
  3. Transcribe Audio:

    • Use the pipeline to transcribe audio:
      pipe = pipeline("automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=device)
      result = pipe("audio.mp3")
      print(result["text"])
      
  4. Cloud GPUs:

    • For enhanced performance, consider using cloud-based GPUs such as those available through AWS, Google Cloud, or Azure.

License

The Whisper model is licensed under the MIT License, allowing for wide-ranging use and modification with appropriate attribution.

More Related APIs in Automatic Speech Recognition