wav2vec2 large vi
nguyenvulebinhIntroduction
The WAV2VEC2-LARGE-VI model is a Vietnamese self-supervised learning model using the wav2vec2 architecture. It is pre-trained on a vast dataset of 13,000 hours of Vietnamese audio from YouTube, encompassing various audio conditions and dialects.
Architecture
The model employs the wav2vec2 architecture for self-supervised learning, which is used for processing audio inputs and extracting meaningful representations. This architecture is similar to the one used in the English version of wav2vec2.
Training
The pre-training of the WAV2VEC2-LARGE-VI model was conducted over 20 epochs using TPU V3-8, taking approximately 30 days. The model is available in two versions:
- Base Version: Contains approximately 95 million parameters.
- Large Version: Contains approximately 317 million parameters.
Guide: Running Locally
Basic Steps
-
Install Required Libraries:
pip install transformers==4.20.0 pip install https://github.com/kpu/kenlm/archive/master.zip pip install pyctcdecode==0.4.0
-
Load the Model and Processor:
from transformers import Wav2Vec2ForPreTraining, Wav2Vec2Processor model_name = 'nguyenvulebinh/wav2vec2-large-vi' model = Wav2Vec2ForPreTraining.from_pretrained(model_name) processor = Wav2Vec2Processor.from_pretrained(model_name)
-
Inference Example:
import torchaudio audio, sample_rate = torchaudio.load('your_audio_file.wav') input_data = processor(audio, sampling_rate=16000, return_tensors='pt') output = model(**input_data) print(processor.decode(output.logits.argmax(dim=-1)[0].detach().cpu().numpy()))
Cloud GPUs
For more efficient training and inference, consider using cloud-based GPUs from providers like AWS, GCP, or Azure.
License
The WAV2VEC2-LARGE-VI model is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0), which allows for non-commercial use with attribution.