whisper small.en LLM Model

Introduction

Whisper is a pre-trained model designed for automatic speech recognition (ASR) and speech translation, developed by OpenAI. It utilizes a large dataset of 680,000 hours of labeled speech data, demonstrating high generalization across various datasets without requiring fine-tuning. Originally detailed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al., Whisper is available in multiple configurations for different use cases.

Architecture

Whisper uses a Transformer-based encoder-decoder model, often referred to as a sequence-to-sequence model. The training encompasses English-only and multilingual datasets, allowing the model to handle speech recognition and translation tasks. Whisper's checkpoints vary in size, with the larger ones supporting multilingual capabilities. The five model sizes are: tiny, base, small, medium, and large, available on the Hugging Face Hub.

Training

Whisper models were trained on a vast amount of audio data, with 65% in English. The training data includes transcripts collected from the internet, covering 98 languages. This large-scale, weakly supervised training enables Whisper to handle various accents, background noise, and perform zero-shot translation into English. However, the model may struggle with low-resource languages and exhibit hallucinations due to noisy data.

Guide: Running Locally

Setup Environment:
- Install PyTorch and Transformers library from Hugging Face.
- Use a Python environment with the necessary dependencies.
Load Model and Processor:
- Load the Whisper model and processor using the WhisperProcessor and WhisperForConditionalGeneration classes from the Transformers library.
Transcription:
- Use the processor to convert audio inputs to log-Mel spectrograms.
- Generate token IDs using the model and decode them to obtain transcriptions.
Evaluation:
- Evaluate model performance using datasets like LibriSpeech and calculate Word Error Rate (WER).
Recommended Resources:
- Utilize cloud GPUs (e.g., AWS, Google Cloud, Azure) for efficient processing, especially for larger models.

License

Whisper is licensed under the Apache 2.0 License, allowing free use with compliance to the terms and conditions specified in the license.

More Related APIs in Automatic Speech Recognition