whisper tiny

openai

Introduction

Whisper is a pre-trained model designed for automatic speech recognition (ASR) and speech translation. Developed by OpenAI, it is based on robust speech recognition via large-scale weak supervision. The model is trained on 680,000 hours of labeled data, enabling it to generalize effectively across various datasets and domains without fine-tuning.

Architecture

Whisper uses a Transformer-based encoder-decoder structure, functioning as a sequence-to-sequence model. It is available in five configurations of varying sizes, each trained on either English-only or multilingual datasets. The largest models are strictly multilingual. Whisper can perform both speech recognition and translation by predicting transcriptions in the same or different language as the audio, respectively.

Training

The model was trained on a diverse set of audio data collected from the internet, with 65% being English-language audio paired with English transcripts. The remaining data covers non-English audio, representing 98 languages. The training approach employs large-scale weak supervision, making the model robust to various accents, background noise, and technical language.

Guide: Running Locally

  1. Installation: Ensure you have the transformers library installed.
  2. Load Model and Processor: Use the WhisperProcessor and WhisperForConditionalGeneration from Hugging Face Transformers.
  3. Prepare Dataset: Load audio datasets using libraries like datasets.
  4. Generate Transcriptions: Convert audio inputs to log-Mel spectrograms, generate token IDs, and decode them into text.
  5. Example Transcription: Use the provided code snippets for English or multilingual transcription and translation tasks.
  6. Evaluation: Measure performance using metrics such as Word Error Rate (WER).

For efficient processing, consider using cloud GPUs such as those from AWS or Google Cloud.

License

The Whisper model is released under the Apache-2.0 license, allowing for broad usage and modification.

More Related APIs in Automatic Speech Recognition