wav2vec2 base vi
nguyenvulebinhIntroduction
The WAV2VEC2-BASE-VI model is a Vietnamese self-supervised learning model based on the wav2vec2 architecture. It is designed for speech processing tasks and pre-trained on a substantial dataset of Vietnamese audio from YouTube.
Architecture
The model uses the wav2vec2 architecture, which is suitable for self-supervised learning tasks. This architecture is particularly effective for processing speech data and has been adapted from the original English version.
Training
The model is pre-trained on 13,000 hours of Vietnamese YouTube audio, encompassing clean and noisy audio, conversations, and various dialects and gender voices. The base version contains approximately 95 million parameters, while the large version has around 317 million parameters. Training was conducted over 35 epochs for the base model and 20 epochs for the large model using TPU V3-8 over roughly 30 days.
Guide: Running Locally
- Installation:
- Ensure you have Python and
pip
installed. - Install required packages with:
pip install transformers==4.20.0 pip install https://github.com/kpu/kenlm/archive/master.zip pip install pyctcdecode==0.4.0
- Ensure you have Python and
- Load Model and Processor:
- Use the following Python code to load the model:
from transformers import Wav2Vec2ForPreTraining, Wav2Vec2Processor model_name = 'nguyenvulebinh/wav2vec2-base-vi' model = Wav2Vec2ForPreTraining.from_pretrained(model_name) processor = Wav2Vec2Processor.from_pretrained(model_name)
- Use the following Python code to load the model:
- Inference:
- Load and process audio data using
torchaudio
and the model’s processor. - Decode the output to obtain the transcription.
- Load and process audio data using
Cloud GPUs
For optimal performance, especially with large models, consider using cloud computing resources like Google Cloud or AWS, which offer GPU instances for accelerated processing.
License
The model is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (cc-by-nc-4.0), allowing for adaptation and sharing for non-commercial purposes.