whisper tiny.en

openai

Introduction

Whisper is a pre-trained model developed by OpenAI for automatic speech recognition (ASR) and speech translation. Trained on 680,000 hours of labeled data, it generalizes well across various datasets and domains. Whisper is available in English-only and multilingual versions, with several model sizes offered on the Hugging Face Hub.

Architecture

Whisper is a Transformer-based encoder-decoder model, also known as a sequence-to-sequence model. The model has been trained on both English-only and multilingual datasets, supporting tasks like speech recognition and translation. It predicts transcriptions either in the same language or translates them into a different language. Whisper's checkpoints vary in size, including tiny, base, small, medium, and large configurations, with both English-only and multilingual options.

Training

The models were trained on 680,000 hours of audio and corresponding transcripts sourced from the internet. The dataset includes 65% English audio, 18% non-English audio with English transcripts, and 17% non-English audio with corresponding transcripts, covering 98 different languages. The performance on transcription correlates with the amount of training data in each language.

Guide: Running Locally

  1. Install Required Libraries: Ensure you have transformers, datasets, and torch installed.
  2. Load Model and Processor:
    from transformers import WhisperProcessor, WhisperForConditionalGeneration
    processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")
    model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en").to("cuda")
    
  3. Prepare Audio Input: Use the datasets library to load and preprocess your audio data.
  4. Perform Transcription:
    input_features = processor(audio_array, sampling_rate, return_tensors="pt").input_features
    predicted_ids = model.generate(input_features)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    
  5. Evaluate Performance: Use the wer metric to assess transcription accuracy.
  6. Long-Form Audio: Use chunking for longer audio, setting chunk_length_s=30 in the pipeline.

Cloud GPUs: For optimal performance, consider using cloud services like AWS, Google Cloud, or Azure, which provide GPU resources to accelerate processing.

License

Whisper is released under the Apache 2.0 License, allowing for both personal and commercial use with attribution.

More Related APIs in Automatic Speech Recognition