wav2vec2 base vietnamese 250h
nguyenvulebinhIntroduction
The WAV2VEC2-BASE-VIETNAMESE-250H model is an automatic speech recognition (ASR) system for the Vietnamese language. It utilizes the Wav2Vec 2.0 architecture, which is designed to learn powerful representations from speech audio, followed by fine-tuning on transcribed speech data.
Architecture
The model is pre-trained on 13,000 hours of unlabeled Vietnamese YouTube audio and fine-tuned on 250 hours of labeled speech from the VLSP ASR dataset. The Wav2Vec 2.0 architecture facilitates this process with a focus on extracting meaningful features from raw audio. Fine-tuning employs Connectionist Temporal Classification (CTC), a sequence-to-sequence training algorithm often used in ASR.
Training
- Pre-training Data: 13,000 hours of unlabeled YouTube audio.
- Fine-tuning Data: 250 hours of labeled VLSP ASR dataset audio.
- Model Parameters: Approximately 95 million.
- Language Model: A 4-grams language model trained on 2GB of spoken text is used for improved accuracy.
- WER Results:
- Without LM: VIVOS 10.77, Common Voice VI 18.34, VLSP-T1 13.33, VLSP-T2 51.45
- With 4-grams LM: VIVOS 6.15, Common Voice VI 11.52, VLSP-T1 9.11, VLSP-T2 40.81
Guide: Running Locally
-
Requirements:
- Ensure audio input is sampled at 16kHz.
- Audio length should be shorter than 10 seconds.
-
Setup:
- Install the necessary libraries:
transformers
,datasets
, andsoundfile
. - Load the model and processor using the
transformers
library:from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC processor = Wav2Vec2Processor.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h") model = Wav2Vec2ForCTC.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h")
- Prepare and tokenize the audio:
import soundfile as sf import torch def map_to_array(batch): speech, _ = sf.read(batch["file"]) batch["speech"] = speech return batch ds = map_to_array({"file": 'audio-test/t1_0001-00010.wav'}) input_values = processor(ds["speech"], return_tensors="pt", padding="longest").input_values
- Obtain and decode the output:
logits = model(input_values).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids)
- Install the necessary libraries:
-
Cloud GPU Recommendation:
- For enhanced performance, it is recommended to use cloud-based GPUs like those offered by AWS, Google Cloud, or Azure.
License
The model is available under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license, which permits use only for non-commercial purposes. More details can be found here.