wav2vec2 large xlsr persian v3 LLM Model

Introduction

This document provides an overview of the WAV2VEC2-LARGE-XLSR-PERSIAN-V3 model, which is designed for Automatic Speech Recognition (ASR) in the Persian language using the Wav2Vec2 architecture. It is fine-tuned from Facebook's wav2vec2-large-xlsr-53 model and utilizes the Common Voice dataset.

Architecture

The model leverages the Wav2Vec2 architecture, specifically the large variant wav2vec2-large-xlsr-53, which is well-suited for cross-lingual speech recognition tasks. It is built using the Transformers library and supports integration with both PyTorch and TensorFlow.

Training

The model is fine-tuned on Persian speech data from the Common Voice dataset. The fine-tuning process involves adjusting the pre-trained Wav2Vec2 model to better recognize and transcribe Persian speech, achieving a Word Error Rate (WER) of 10.36% on the test set.

Guide: Running Locally

To run the model locally, follow these steps:

Install Required Packages:

pip install git+https://github.com/huggingface/datasets.git
pip install git+https://github.com/huggingface/transformers.git
pip install torchaudio librosa jiwer parsivar num2fawords

Download and Prepare Data:
Download the Common Voice dataset for Persian and extract it:

wget https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/fa.tar.gz
tar -xzf fa.tar.gz
rm -rf fa.tar.gz

Data Cleaning:
Use the provided normalizer script to clean the data:

from normalizer import normalizer
# Define a cleaning function and apply it to your dataset

Load and Prepare the Model:

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
model_name_or_path = "m3hrdadfi/wav2vec2-large-xlsr-persian-v3"
processor = Wav2Vec2Processor.from_pretrained(model_name_or_path)
model = Wav2Vec2ForCTC.from_pretrained(model_name_or_path).to(device)

Make Predictions:
Use the model to predict transcriptions from audio files.
Evaluate the Model:
Calculate the WER to evaluate the performance of the model.

Cloud GPUs: Consider using cloud-based services like AWS, Google Cloud, or Azure for access to powerful GPUs suitable for model inference and training.

License

The model and its associated code are shared under a license that should be checked directly on its repository page for specific terms and conditions.

More Related APIs in Automatic Speech Recognition