wav2vec2 large xlsr 53 vietnamese

anuragshas

Introduction

The WAV2VEC2-LARGE-XLSR-53-VIETNAMESE model is designed for automatic speech recognition (ASR) tasks, specifically fine-tuned for the Vietnamese language. It is based on the facebook/wav2vec2-large-xlsr-53 model and utilizes the Common Voice dataset.

Architecture

The model employs the Wav2Vec2 architecture, which is effective for unsupervised learning of speech representations. It is implemented in PyTorch and is compatible with the Transformers library. The model is fine-tuned to handle the Vietnamese language and processes audio sampled at 16kHz.

Training

The model was trained using the Common Voice dataset, focusing on the Vietnamese language. The training utilized both the train and validation subsets of the dataset. A significant metric used to evaluate the model's performance is the Word Error Rate (WER), with a test result of 66.78%.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Environment Setup: Ensure you have Python installed, and set up a virtual environment. Install necessary libraries:

    pip install torch torchaudio transformers datasets
    
  2. Model and Dataset Loading:

    import torch
    import torchaudio
    from datasets import load_dataset
    from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
    
    test_dataset = load_dataset("common_voice", "vi", split="test[:2%]")
    processor = Wav2Vec2Processor.from_pretrained("anuragshas/wav2vec2-large-xlsr-53-vietnamese")
    model = Wav2Vec2ForCTC.from_pretrained("anuragshas/wav2vec2-large-xlsr-53-vietnamese")
    
  3. Preprocessing and Inference:

    • Preprocess audio data to the required format.
    • Use the model to predict and decode audio inputs.
  4. Evaluation:

    • Evaluate on test data using the WER metric for performance validation.

For efficient processing, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

The model is released under the Apache 2.0 License, allowing for commercial use, distribution, modification, and private use, with conditions.

More Related APIs in Automatic Speech Recognition