wav2vec2 base vietnamese 250h

nguyenvulebinh

Introduction

The WAV2VEC2-BASE-VIETNAMESE-250H model is an automatic speech recognition (ASR) system for the Vietnamese language. It utilizes the Wav2Vec 2.0 architecture, which is designed to learn powerful representations from speech audio, followed by fine-tuning on transcribed speech data.

Architecture

The model is pre-trained on 13,000 hours of unlabeled Vietnamese YouTube audio and fine-tuned on 250 hours of labeled speech from the VLSP ASR dataset. The Wav2Vec 2.0 architecture facilitates this process with a focus on extracting meaningful features from raw audio. Fine-tuning employs Connectionist Temporal Classification (CTC), a sequence-to-sequence training algorithm often used in ASR.

Training

  • Pre-training Data: 13,000 hours of unlabeled YouTube audio.
  • Fine-tuning Data: 250 hours of labeled VLSP ASR dataset audio.
  • Model Parameters: Approximately 95 million.
  • Language Model: A 4-grams language model trained on 2GB of spoken text is used for improved accuracy.
  • WER Results:
    • Without LM: VIVOS 10.77, Common Voice VI 18.34, VLSP-T1 13.33, VLSP-T2 51.45
    • With 4-grams LM: VIVOS 6.15, Common Voice VI 11.52, VLSP-T1 9.11, VLSP-T2 40.81

Guide: Running Locally

  1. Requirements:

    • Ensure audio input is sampled at 16kHz.
    • Audio length should be shorter than 10 seconds.
  2. Setup:

    • Install the necessary libraries: transformers, datasets, and soundfile.
    • Load the model and processor using the transformers library:
      from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
      processor = Wav2Vec2Processor.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h")
      model = Wav2Vec2ForCTC.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h")
      
    • Prepare and tokenize the audio:
      import soundfile as sf
      import torch
      
      def map_to_array(batch):
          speech, _ = sf.read(batch["file"])
          batch["speech"] = speech
          return batch
      
      ds = map_to_array({"file": 'audio-test/t1_0001-00010.wav'})
      input_values = processor(ds["speech"], return_tensors="pt", padding="longest").input_values
      
    • Obtain and decode the output:
      logits = model(input_values).logits
      predicted_ids = torch.argmax(logits, dim=-1)
      transcription = processor.batch_decode(predicted_ids)
      
  3. Cloud GPU Recommendation:

    • For enhanced performance, it is recommended to use cloud-based GPUs like those offered by AWS, Google Cloud, or Azure.

License

The model is available under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license, which permits use only for non-commercial purposes. More details can be found here.

More Related APIs in Automatic Speech Recognition