viwav2vec2 base 3k
dragonSwingIntroduction
The viwav2vec2-base-3k
model is a Wav2Vec2 base model pre-trained on a large Vietnamese speech corpus. It is designed for Automatic Speech Recognition tasks and is built using the PyTorch framework. The model is specifically pre-trained on 3,000 hours of Vietnamese audio data, including spontaneous, reading, and broadcasting speech.
Architecture
The model is based on the Wav2Vec2 architecture, which learns speech representations from raw audio input. It is pre-trained on 16kHz sampled audio and is intended to be fine-tuned for specific tasks such as Vietnamese Automatic Speech Recognition.
Training
The model is pre-trained on 3,000 hours of Vietnamese speech data. Since it does not include a tokenizer, it requires fine-tuning on labeled text data for speech recognition tasks. This involves creating a tokenizer and aligning the model with the specific downstream task.
Guide: Running Locally
To run the viwav2vec2-base-3k
model locally, follow these steps:
-
Install Dependencies: Ensure you have PyTorch and Transformers installed.
pip install torch transformers
-
Load the Model:
import torch from transformers import Wav2Vec2Model model = Wav2Vec2Model.from_pretrained("dragonSwing/viwav2vec2-base-3k")
-
Sanity Check:
inputs = torch.rand([1, 16000]) outputs = model(inputs)
-
Fine-Tuning: Refer to this notebook for detailed steps on fine-tuning the model.
Recommendation: For efficient training and fine-tuning, consider using cloud GPUs such as those offered by Google Cloud, AWS, or Azure.
License
The viwav2vec2-base-3k
model is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (cc-by-sa-4.0).