whisper small cantonese

alvanlii

Introduction

The Whisper Small Cantonese model is a fine-tuned version of OpenAI's Whisper-Small model, optimized for Automatic Speech Recognition (ASR) in the Cantonese language. It achieves a Character Error Rate (CER) of 7.93 on Common Voice 16.0 without punctuation.

Architecture

  • Base Model: OpenAI Whisper-Small
  • Languages: Chinese (zh), Yue Chinese (yue)
  • Datasets: Mozilla Foundation Common Voice 16.0 and 17.0
  • Model Index: Evaluated on Common Voice 16.0 yue Test set
  • Metrics: Normalized CER of 7.93

Training

The model was trained using multiple datasets, including CantoMap and Cantonese-ASR, with a total of several hundred hours of audio. Training featured a learning rate of 5e-5, batch sizes tailored for a single RTX 3090 GPU, and an Adam optimizer. The model was evaluated using CER with and without punctuation.

Guide: Running Locally

  1. Install Dependencies: Ensure librosa and transformers libraries are installed.
  2. Load the Model:
    from transformers import WhisperForConditionalGeneration, WhisperProcessor
    processor = WhisperProcessor.from_pretrained("alvanlii/whisper-small-cantonese")
    model = WhisperForConditionalGeneration.from_pretrained("alvanlii/whisper-small-cantonese")
    
  3. Process Audio:
    import librosa
    y, sr = librosa.load('audio.mp3', sr=16000)
    processed_in = processor(y, sampling_rate=sr, return_tensors="pt")
    
  4. Generate Transcriptions:
    gout = model.generate(input_features=processed_in.input_features)
    transcription = processor.batch_decode(gout.sequences, skip_special_tokens=True)[0]
    print(transcription)
    
  5. Hardware: For improved inference speed, a cloud GPU such as NVIDIA RTX 3090 is recommended.

License

The model is licensed under Apache 2.0.

More Related APIs in Automatic Speech Recognition