wav2vec2 large xlsr 53 arabic

elgeish

Introduction

The WAV2VEC2-LARGE-XLSR-53-ARABIC model is a fine-tuned version of the facebook/wav2vec2-large-xlsr-53 model on Arabic speech data. It supports automatic speech recognition (ASR) for Arabic by using datasets like the Common Voice and Arabic Speech Corpus. The model is designed to process audio sampled at 16kHz and provides a Word Error Rate (WER) metric to evaluate its accuracy.

Architecture

The model is built on top of the Wav2Vec2 architecture, specifically the large XLSR-53 variant. It leverages the expressive power of transformers to process audio data and perform speech-to-text conversion. The fine-tuning process adapts the model to the unique characteristics of the Arabic language using the Buckwalter transliteration format.

Training

The training involved two main datasets: the Arabic Speech Corpus and Common Voice. Initially, the model was fine-tuned on the Arabic Speech Corpus, and further adjustments were made using the Common Voice dataset. Model selection was based on the test and validation splits of these datasets. The model achieved a validation WER of 23.39% after 100k training steps.

Guide: Running Locally

  1. Set Up Environment: Ensure Python and necessary libraries such as torch, torchaudio, and transformers are installed.
  2. Load Data: Use the Common Voice dataset for testing.
  3. Prepare Audio: Resample audio to 16kHz if needed.
  4. Initialize Model and Processor: Load the pre-trained model and processor using:
    from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
    processor = Wav2Vec2Processor.from_pretrained("elgeish/wav2vec2-large-xlsr-53-arabic")
    model = Wav2Vec2ForCTC.from_pretrained("elgeish/wav2vec2-large-xlsr-53-arabic").eval()
    
  5. Inference: Process audio inputs and decode predictions.
  6. Evaluate: Utilize metrics like WER to assess performance.

Cloud GPUs: For efficient computation, consider using cloud services like AWS EC2 with GPU instances, Google Cloud GPU instances, or Azure GPU VMs.

License

The model is licensed under the Apache 2.0 License, allowing for broad usage with minimal restrictions.

More Related APIs in Automatic Speech Recognition