wav2vec2 large xlsr 53 persian
jonatasgrosmanIntroduction
The wav2vec2-large-xlsr-53-persian
model is a fine-tuned version of Facebook's Wav2Vec2-Large-XLSR-53 designed specifically for Persian automatic speech recognition. It is trained on the Common Voice 6.1 dataset and optimized to work with speech input sampled at 16kHz.
Architecture
The model is based on the Wav2Vec2 architecture, specifically the large version of the XLSR-53, which is a multilingual model capable of processing audio data for speech recognition tasks.
Training
This model was fine-tuned on the Persian language using the train and validation splits of the Common Voice 6.1 dataset. The computing resources for training were provided by OVHcloud. The training script can be found on GitHub.
Guide: Running Locally
- Install Dependencies: Ensure Python and the required libraries such as
torch
,librosa
,transformers
, anddatasets
are installed. - Load Model and Processor:
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor processor = Wav2Vec2Processor.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-persian") model = Wav2Vec2ForCTC.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-persian")
- Prepare Audio Data: Load your audio files as arrays using
librosa
. - Run Inference: Process and transcribe the audio data using the model.
- Evaluate: Use metrics like Word Error Rate (WER) and Character Error Rate (CER) to evaluate performance.
For better performance, consider using cloud GPUs from providers like AWS, Google Cloud, or OVHcloud.
License
The model is licensed under the Apache 2.0 License, which allows for both personal and commercial use, modification, and distribution.