wav2vec2 large xlsr persian v3

m3hrdadfi

Introduction

This document provides an overview of the WAV2VEC2-LARGE-XLSR-PERSIAN-V3 model, which is designed for Automatic Speech Recognition (ASR) in the Persian language using the Wav2Vec2 architecture. It is fine-tuned from Facebook's wav2vec2-large-xlsr-53 model and utilizes the Common Voice dataset.

Architecture

The model leverages the Wav2Vec2 architecture, specifically the large variant wav2vec2-large-xlsr-53, which is well-suited for cross-lingual speech recognition tasks. It is built using the Transformers library and supports integration with both PyTorch and TensorFlow.

Training

The model is fine-tuned on Persian speech data from the Common Voice dataset. The fine-tuning process involves adjusting the pre-trained Wav2Vec2 model to better recognize and transcribe Persian speech, achieving a Word Error Rate (WER) of 10.36% on the test set.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install Required Packages:

    pip install git+https://github.com/huggingface/datasets.git
    pip install git+https://github.com/huggingface/transformers.git
    pip install torchaudio librosa jiwer parsivar num2fawords
    
  2. Download and Prepare Data:
    Download the Common Voice dataset for Persian and extract it:

    wget https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/fa.tar.gz
    tar -xzf fa.tar.gz
    rm -rf fa.tar.gz
    
  3. Data Cleaning:
    Use the provided normalizer script to clean the data:

    from normalizer import normalizer
    # Define a cleaning function and apply it to your dataset
    
  4. Load and Prepare the Model:

    from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
    model_name_or_path = "m3hrdadfi/wav2vec2-large-xlsr-persian-v3"
    processor = Wav2Vec2Processor.from_pretrained(model_name_or_path)
    model = Wav2Vec2ForCTC.from_pretrained(model_name_or_path).to(device)
    
  5. Make Predictions:
    Use the model to predict transcriptions from audio files.

  6. Evaluate the Model:
    Calculate the WER to evaluate the performance of the model.

Cloud GPUs: Consider using cloud-based services like AWS, Google Cloud, or Azure for access to powerful GPUs suitable for model inference and training.

License

The model and its associated code are shared under a license that should be checked directly on its repository page for specific terms and conditions.

More Related APIs in Automatic Speech Recognition