wav2vec2 large xlsr 53 persian

jonatasgrosman

Introduction

The wav2vec2-large-xlsr-53-persian model is a fine-tuned version of Facebook's Wav2Vec2-Large-XLSR-53 designed specifically for Persian automatic speech recognition. It is trained on the Common Voice 6.1 dataset and optimized to work with speech input sampled at 16kHz.

Architecture

The model is based on the Wav2Vec2 architecture, specifically the large version of the XLSR-53, which is a multilingual model capable of processing audio data for speech recognition tasks.

Training

This model was fine-tuned on the Persian language using the train and validation splits of the Common Voice 6.1 dataset. The computing resources for training were provided by OVHcloud. The training script can be found on GitHub.

Guide: Running Locally

  1. Install Dependencies: Ensure Python and the required libraries such as torch, librosa, transformers, and datasets are installed.
  2. Load Model and Processor:
    from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
    processor = Wav2Vec2Processor.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-persian")
    model = Wav2Vec2ForCTC.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-persian")
    
  3. Prepare Audio Data: Load your audio files as arrays using librosa.
  4. Run Inference: Process and transcribe the audio data using the model.
  5. Evaluate: Use metrics like Word Error Rate (WER) and Character Error Rate (CER) to evaluate performance.

For better performance, consider using cloud GPUs from providers like AWS, Google Cloud, or OVHcloud.

License

The model is licensed under the Apache 2.0 License, which allows for both personal and commercial use, modification, and distribution.

More Related APIs in Automatic Speech Recognition