wav2vec2 large xlsr 53 arabic

jonatasgrosman

Introduction

The wav2vec2-large-xlsr-53-arabic model is a fine-tuned version of Facebook's wav2vec2-large-xlsr-53 for Arabic speech recognition. It utilizes the Common Voice 6.1 and Arabic Speech Corpus datasets for training and validation. The model processes speech inputs sampled at 16kHz and is designed for automatic speech recognition (ASR) tasks. It was developed with support from OVHcloud's GPU credits.

Architecture

This model is built upon the Wav2Vec 2.0 architecture, specifically the wav2vec2-large-xlsr-53 variant, which is tailored for multilingual speech recognition tasks. The model outputs transcriptions based on the input audio, leveraging its pre-trained capabilities on a large corpus of diverse languages and fine-tuning specifically for Arabic.

Training

The model was fine-tuned on the Arabic language using the Common Voice and Arabic Speech Corpus datasets. The training script is available on GitHub, and the process involved adjusting the model to recognize and transcribe spoken Arabic with improved accuracy. Performance metrics include a Word Error Rate (WER) of 39.59% and a Character Error Rate (CER) of 18.18%.

Guide: Running Locally

Basic Steps

  1. Install Required Libraries: Ensure you have librosa, torch, transformers, and datasets installed.

    pip install librosa torch transformers datasets
    
  2. Load Pre-trained Model:

    from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
    processor = Wav2Vec2Processor.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-arabic")
    model = Wav2Vec2ForCTC.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-arabic")
    
  3. Prepare Audio Data: Load your audio files and ensure they are sampled at 16kHz.

    import librosa
    speech_array, sampling_rate = librosa.load("path/to/audio.wav", sr=16_000)
    
  4. Transcribe Audio:

    inputs = processor(speech_array, sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    

Suggest Cloud GPUs

For efficient processing, consider using cloud-based GPU services such as AWS EC2, Google Cloud Platform, or Azure. These platforms provide scalable resources ideal for handling large datasets and complex model inference tasks.

License

This model is licensed under the Apache 2.0 License, which permits use, distribution, and modification under defined conditions. Ensure compliance with the license terms when using the model in your projects.

More Related APIs in Automatic Speech Recognition