wav2vec2 base vi LLM Model

Introduction

The WAV2VEC2-BASE-VI model is a Vietnamese self-supervised learning model based on the wav2vec2 architecture. It is designed for speech processing tasks and pre-trained on a substantial dataset of Vietnamese audio from YouTube.

Architecture

The model uses the wav2vec2 architecture, which is suitable for self-supervised learning tasks. This architecture is particularly effective for processing speech data and has been adapted from the original English version.

Training

The model is pre-trained on 13,000 hours of Vietnamese YouTube audio, encompassing clean and noisy audio, conversations, and various dialects and gender voices. The base version contains approximately 95 million parameters, while the large version has around 317 million parameters. Training was conducted over 35 epochs for the base model and 20 epochs for the large model using TPU V3-8 over roughly 30 days.

Guide: Running Locally

Installation:

Ensure you have Python and pip installed.

Install required packages with:

pip install transformers==4.20.0
pip install https://github.com/kpu/kenlm/archive/master.zip
pip install pyctcdecode==0.4.0

Load Model and Processor:

Use the following Python code to load the model:

from transformers import Wav2Vec2ForPreTraining, Wav2Vec2Processor
model_name = 'nguyenvulebinh/wav2vec2-base-vi'
model = Wav2Vec2ForPreTraining.from_pretrained(model_name)
processor = Wav2Vec2Processor.from_pretrained(model_name)

Inference:
- Load and process audio data using torchaudio and the model’s processor.
- Decode the output to obtain the transcription.

Cloud GPUs

For optimal performance, especially with large models, consider using cloud computing resources like Google Cloud or AWS, which offer GPU instances for accelerated processing.

License

The model is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (cc-by-nc-4.0), allowing for adaptation and sharing for non-commercial purposes.

More Related APIs