whisper small cantonese LLM Model

Introduction

The Whisper Small Cantonese model is a fine-tuned version of OpenAI's Whisper-Small model, optimized for Automatic Speech Recognition (ASR) in the Cantonese language. It achieves a Character Error Rate (CER) of 7.93 on Common Voice 16.0 without punctuation.

Architecture

Base Model: OpenAI Whisper-Small
Languages: Chinese (zh), Yue Chinese (yue)
Datasets: Mozilla Foundation Common Voice 16.0 and 17.0
Model Index: Evaluated on Common Voice 16.0 yue Test set
Metrics: Normalized CER of 7.93

Training

The model was trained using multiple datasets, including CantoMap and Cantonese-ASR, with a total of several hundred hours of audio. Training featured a learning rate of 5e-5, batch sizes tailored for a single RTX 3090 GPU, and an Adam optimizer. The model was evaluated using CER with and without punctuation.

Guide: Running Locally

Install Dependencies: Ensure librosa and transformers libraries are installed.

Load the Model:

from transformers import WhisperForConditionalGeneration, WhisperProcessor
processor = WhisperProcessor.from_pretrained("alvanlii/whisper-small-cantonese")
model = WhisperForConditionalGeneration.from_pretrained("alvanlii/whisper-small-cantonese")

Process Audio:

import librosa
y, sr = librosa.load('audio.mp3', sr=16000)
processed_in = processor(y, sampling_rate=sr, return_tensors="pt")

Generate Transcriptions:

gout = model.generate(input_features=processed_in.input_features)
transcription = processor.batch_decode(gout.sequences, skip_special_tokens=True)[0]
print(transcription)

Hardware: For improved inference speed, a cloud GPU such as NVIDIA RTX 3090 is recommended.

License

The model is licensed under Apache 2.0.

More Related APIs in Automatic Speech Recognition