whisper large v3 turbo LLM Model

Introduction

Whisper is an advanced model for automatic speech recognition (ASR) and speech translation, developed by OpenAI. It is trained on over 5 million hours of labeled data, demonstrating robust generalization across various datasets and domains. Whisper large-v3-turbo is a finetuned and optimized version of Whisper large-v3, featuring reduced decoding layers for enhanced speed at a minor quality cost.

Architecture

Whisper is a Transformer-based encoder-decoder, also known as a sequence-to-sequence model, available in both English-only and multilingual versions. The model predicts transcriptions in the source audio language or translates speech into English. Whisper large-v3-turbo has 809 million parameters, optimized for faster processing by reducing decoding layers.

Training

Whisper's training involved large-scale weak supervision using noisy data, which enhances its robustness to accents, background noise, and technical language. It is capable of zero-shot translation from multiple languages into English. Despite its strengths, there are challenges such as hallucinations, lower performance on low-resource languages, and variations in accuracy across different demographic groups.

Guide: Running Locally

Installation:

Upgrade pip, and install the necessary libraries:

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

Setup:

Load the model and processor:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch.float16, low_cpu_mem_usage=True).to(device)
processor = AutoProcessor.from_pretrained(model_id)

Transcribe Audio:

Use the pipeline to transcribe audio:

pipe = pipeline("automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=device)
result = pipe("audio.mp3")
print(result["text"])