whisper small cantonese
alvanliiIntroduction
The Whisper Small Cantonese model is a fine-tuned version of OpenAI's Whisper-Small model, optimized for Automatic Speech Recognition (ASR) in the Cantonese language. It achieves a Character Error Rate (CER) of 7.93 on Common Voice 16.0 without punctuation.
Architecture
- Base Model: OpenAI Whisper-Small
- Languages: Chinese (zh), Yue Chinese (yue)
- Datasets: Mozilla Foundation Common Voice 16.0 and 17.0
- Model Index: Evaluated on Common Voice 16.0 yue Test set
- Metrics: Normalized CER of 7.93
Training
The model was trained using multiple datasets, including CantoMap and Cantonese-ASR, with a total of several hundred hours of audio. Training featured a learning rate of 5e-5, batch sizes tailored for a single RTX 3090 GPU, and an Adam optimizer. The model was evaluated using CER with and without punctuation.
Guide: Running Locally
- Install Dependencies: Ensure
librosa
andtransformers
libraries are installed. - Load the Model:
from transformers import WhisperForConditionalGeneration, WhisperProcessor processor = WhisperProcessor.from_pretrained("alvanlii/whisper-small-cantonese") model = WhisperForConditionalGeneration.from_pretrained("alvanlii/whisper-small-cantonese")
- Process Audio:
import librosa y, sr = librosa.load('audio.mp3', sr=16000) processed_in = processor(y, sampling_rate=sr, return_tensors="pt")
- Generate Transcriptions:
gout = model.generate(input_features=processed_in.input_features) transcription = processor.batch_decode(gout.sequences, skip_special_tokens=True)[0] print(transcription)
- Hardware: For improved inference speed, a cloud GPU such as NVIDIA RTX 3090 is recommended.
License
The model is licensed under Apache 2.0.