whisper medium

openai

Introduction

Whisper is a pre-trained model developed by OpenAI for automatic speech recognition (ASR) and speech translation. It is trained on 680,000 hours of labeled data, showcasing strong generalization across various datasets and domains. The model is particularly effective for English speech recognition.

Architecture

Whisper employs a Transformer-based encoder-decoder architecture, functioning as a sequence-to-sequence model. It supports both English-only and multilingual configurations, with the ability to handle speech recognition and translation tasks. The model is available in five sizes, with varying parameters, ranging from tiny (39M) to large-v2 (1550M). All configurations are accessible on the Hugging Face Hub.

Training

Whisper was trained using large-scale weak supervision on diverse datasets. It includes 65% English-language data, 18% non-English audio with English transcripts, and 17% non-English audio with corresponding transcripts, covering 98 languages. The model's performance correlates with the amount of training data available per language.

Guide: Running Locally

  1. Install Required Libraries: Ensure you have Python and PyTorch installed. Use pip install transformers datasets to install necessary packages.

  2. Load Model and Processor:

    from transformers import WhisperProcessor, WhisperForConditionalGeneration
    processor = WhisperProcessor.from_pretrained("openai/whisper-medium")
    model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium")
    
  3. Prepare Audio Input: Convert audio files to the required format and load them using the datasets library.

  4. Generate Transcription:

    input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
    predicted_ids = model.generate(input_features)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    
  5. Cloud GPUs: For efficient processing, consider using cloud GPU services like AWS, Google Cloud, or Azure.

License

Whisper is released under the Apache-2.0 license, allowing for both personal and commercial use with attribution.

More Related APIs in Automatic Speech Recognition