Vi Whisper medium

NhutP

Introduction

We introduce a new model for Vietnamese speech recognition, fine-tuned from the OpenAI Whisper Medium model using the VSV-1100 dataset.

Architecture

The model is based on the openai/whisper-medium architecture, optimized for Vietnamese language processing.

Training

The model was trained using a composite dataset:

  • VSV-1100: 1100 hours
  • CMV14-vi: 11 hours
  • VIVOS: 3.04 hours
  • VLSP2021: 180 hours
    Total training data amounts to 1308 hours. A text-to-speech model was employed to augment the dataset with sentences containing infrequent words.

WER Results

The Word Error Rate (WER) was evaluated on several datasets:

  • CMV14-vi: 8.1
  • VIVOS: 4.69
  • VLSP2020-T1: 13.22
  • VLSP2020-T2: 28.76
  • VLSP2021-T1: 11.78
  • VLSP2021-T2: 8.28
  • Bud500: 5.38

Guide: Running Locally

Basic Steps

  1. Install Dependencies:
    pip install transformers librosa
    
  2. Load Model and Processor:
    from transformers import WhisperProcessor, WhisperForConditionalGeneration
    processor = WhisperProcessor.from_pretrained("NhutP/ViWhisper-medium")
    model = WhisperForConditionalGeneration.from_pretrained("NhutP/ViWhisper-medium")
    
  3. Inference:
    import librosa
    array, sampling_rate = librosa.load('path_to_audio', sr=16000)
    input_features = processor(array, sampling_rate=sampling_rate, return_tensors="pt").input_features
    predicted_ids = model.generate(input_features)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    

Suggest Cloud GPUs

Consider using cloud GPU services like AWS, Google Cloud, or Azure for improved performance when running inference tasks.

License

The model is released under the MIT License.

More Related APIs in Automatic Speech Recognition