whisper small
openaiIntroduction
Whisper is a pre-trained model developed by OpenAI for Automatic Speech Recognition (ASR) and speech translation, capable of generalizing across various datasets without fine-tuning. It is trained on 680k hours of labeled data, supporting 99 languages and multiple speech tasks.
Architecture
Whisper is based on a Transformer encoder-decoder architecture, known as a sequence-to-sequence model. It processes audio inputs into log-Mel spectrograms and outputs transcriptions or translations using context tokens that guide the task and language processing.
Training
The model was trained on a massive dataset comprising 680,000 hours of audio collected from the internet. The training data includes both English and multilingual audio. The model demonstrates robust performance on ASR tasks across multiple languages, though its accuracy varies with language resources and demographic factors.
Guide: Running Locally
Basic Steps
-
Install Dependencies: Ensure you have the
transformers
anddatasets
libraries installed.pip install transformers datasets
-
Load Model and Processor: Use Hugging Face's
transformers
library to load the Whisper model and processor.from transformers import WhisperProcessor, WhisperForConditionalGeneration processor = WhisperProcessor.from_pretrained("openai/whisper-small") model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
-
Prepare Audio Input: Load your audio data and preprocess it using the processor.
from datasets import load_dataset ds = load_dataset("your_dataset") input_features = processor(ds['audio'][0]['array'], sampling_rate=16000, return_tensors="pt").input_features
-
Generate and Decode: Use the model to generate predictions and decode them to text.
predicted_ids = model.generate(input_features) transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
Cloud GPUs
For performance optimization, consider using cloud GPUs from platforms like AWS, Google Cloud, or Azure, which provide scalable resources for running large models efficiently.
License
The Whisper model is released under the Apache-2.0 license, allowing for broad use and modification with attribution.