whisper small

openai

Introduction

Whisper is a pre-trained model developed by OpenAI for Automatic Speech Recognition (ASR) and speech translation, capable of generalizing across various datasets without fine-tuning. It is trained on 680k hours of labeled data, supporting 99 languages and multiple speech tasks.

Architecture

Whisper is based on a Transformer encoder-decoder architecture, known as a sequence-to-sequence model. It processes audio inputs into log-Mel spectrograms and outputs transcriptions or translations using context tokens that guide the task and language processing.

Training

The model was trained on a massive dataset comprising 680,000 hours of audio collected from the internet. The training data includes both English and multilingual audio. The model demonstrates robust performance on ASR tasks across multiple languages, though its accuracy varies with language resources and demographic factors.

Guide: Running Locally

Basic Steps

  1. Install Dependencies: Ensure you have the transformers and datasets libraries installed.

    pip install transformers datasets
    
  2. Load Model and Processor: Use Hugging Face's transformers library to load the Whisper model and processor.

    from transformers import WhisperProcessor, WhisperForConditionalGeneration
    
    processor = WhisperProcessor.from_pretrained("openai/whisper-small")
    model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
    
  3. Prepare Audio Input: Load your audio data and preprocess it using the processor.

    from datasets import load_dataset
    
    ds = load_dataset("your_dataset")
    input_features = processor(ds['audio'][0]['array'], sampling_rate=16000, return_tensors="pt").input_features
    
  4. Generate and Decode: Use the model to generate predictions and decode them to text.

    predicted_ids = model.generate(input_features)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    

Cloud GPUs

For performance optimization, consider using cloud GPUs from platforms like AWS, Google Cloud, or Azure, which provide scalable resources for running large models efficiently.

License

The Whisper model is released under the Apache-2.0 license, allowing for broad use and modification with attribution.

More Related APIs in Automatic Speech Recognition