wav2vec2 large xlsr 53 vietnamese
anuragshasIntroduction
The WAV2VEC2-LARGE-XLSR-53-VIETNAMESE
model is designed for automatic speech recognition (ASR) tasks, specifically fine-tuned for the Vietnamese language. It is based on the facebook/wav2vec2-large-xlsr-53
model and utilizes the Common Voice dataset.
Architecture
The model employs the Wav2Vec2 architecture, which is effective for unsupervised learning of speech representations. It is implemented in PyTorch and is compatible with the Transformers library. The model is fine-tuned to handle the Vietnamese language and processes audio sampled at 16kHz.
Training
The model was trained using the Common Voice dataset, focusing on the Vietnamese language. The training utilized both the train and validation subsets of the dataset. A significant metric used to evaluate the model's performance is the Word Error Rate (WER), with a test result of 66.78%.
Guide: Running Locally
To run the model locally, follow these steps:
-
Environment Setup: Ensure you have Python installed, and set up a virtual environment. Install necessary libraries:
pip install torch torchaudio transformers datasets
-
Model and Dataset Loading:
import torch import torchaudio from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor test_dataset = load_dataset("common_voice", "vi", split="test[:2%]") processor = Wav2Vec2Processor.from_pretrained("anuragshas/wav2vec2-large-xlsr-53-vietnamese") model = Wav2Vec2ForCTC.from_pretrained("anuragshas/wav2vec2-large-xlsr-53-vietnamese")
-
Preprocessing and Inference:
- Preprocess audio data to the required format.
- Use the model to predict and decode audio inputs.
-
Evaluation:
- Evaluate on test data using the WER metric for performance validation.
For efficient processing, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
The model is released under the Apache 2.0 License, allowing for commercial use, distribution, modification, and private use, with conditions.