wav2vec2 large vi

nguyenvulebinh

Introduction

The WAV2VEC2-LARGE-VI model is a Vietnamese self-supervised learning model using the wav2vec2 architecture. It is pre-trained on a vast dataset of 13,000 hours of Vietnamese audio from YouTube, encompassing various audio conditions and dialects.

Architecture

The model employs the wav2vec2 architecture for self-supervised learning, which is used for processing audio inputs and extracting meaningful representations. This architecture is similar to the one used in the English version of wav2vec2.

Training

The pre-training of the WAV2VEC2-LARGE-VI model was conducted over 20 epochs using TPU V3-8, taking approximately 30 days. The model is available in two versions:

  • Base Version: Contains approximately 95 million parameters.
  • Large Version: Contains approximately 317 million parameters.

Guide: Running Locally

Basic Steps

  1. Install Required Libraries:

    pip install transformers==4.20.0
    pip install https://github.com/kpu/kenlm/archive/master.zip
    pip install pyctcdecode==0.4.0
    
  2. Load the Model and Processor:

    from transformers import Wav2Vec2ForPreTraining, Wav2Vec2Processor
    
    model_name = 'nguyenvulebinh/wav2vec2-large-vi'
    model = Wav2Vec2ForPreTraining.from_pretrained(model_name)
    processor = Wav2Vec2Processor.from_pretrained(model_name)
    
  3. Inference Example:

    import torchaudio
    audio, sample_rate = torchaudio.load('your_audio_file.wav')
    input_data = processor(audio, sampling_rate=16000, return_tensors='pt')
    output = model(**input_data)
    print(processor.decode(output.logits.argmax(dim=-1)[0].detach().cpu().numpy()))
    

Cloud GPUs

For more efficient training and inference, consider using cloud-based GPUs from providers like AWS, GCP, or Azure.

License

The WAV2VEC2-LARGE-VI model is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), which allows for non-commercial use with attribution.

More Related APIs