wav2vec2 large xlsr 53 arabic egyptian

arbml

Introduction

The WAV2VEC2-LARGE-XLSR-53-ARABIC-EGYPTIAN model is designed for automatic speech recognition, fine-tuned for Egyptian Arabic using Facebook's wav2vec2-large-xlsr-53 and the Common Voice dataset.

Architecture

This model leverages the XLSR architecture of Wav2Vec2, which is a transformer-based model designed for speech recognition tasks. It is capable of processing audio inputs directly, producing textual transcriptions.

Training

The model was fine-tuned using the Common Voice dataset, specifically tailored for Egyptian Arabic. Details about the specific training methods employed and the parameters used are unspecified in the documentation.

Guide: Running Locally

Basic Steps

  1. Install Required Libraries: Ensure torch, torchaudio, datasets, and transformers are installed. These can be installed via pip:

    pip install torch torchaudio datasets transformers
    
  2. Load Dataset and Model: Use the datasets library to load the Common Voice dataset and the transformers library to load the pre-trained model and processor:

    from datasets import load_dataset
    from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
    
    test_dataset = load_dataset("common_voice", "ar-EG", split="test[:2%]")
    processor = Wav2Vec2Processor.from_pretrained("Zaid/wav2vec2-large-xlsr-53-arabic-egyptian")
    model = Wav2Vec2ForCTC.from_pretrained("Zaid/wav2vec2-large-xlsr-53-arabic-egyptian")
    
  3. Preprocess Audio Data: Convert audio files to the required sampling rate:

    import torchaudio
    
    resampler = torchaudio.transforms.Resample(48_000, 16_000)
    
    def speech_file_to_array_fn(batch):
        speech_array, sampling_rate = torchaudio.load(batch["path"])
        batch["speech"] = resampler(speech_array).squeeze().numpy()
        return batch
    
    test_dataset = test_dataset.map(speech_file_to_array_fn)
    
  4. Run Inference: Process the audio inputs and decode the predictions:

    import torch
    
    inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
    
    with torch.no_grad():
        logits = model(inputs.input_values).logits
    
    predicted_ids = torch.argmax(logits, dim=-1)
    print("Prediction:", processor.batch_decode(predicted_ids))
    print("Reference:", test_dataset["sentence"][:2])
    

Cloud GPU Suggestion

For efficient processing, consider using cloud-based GPU services like AWS EC2, Google Cloud, or Azure to handle large datasets or intensive computations.

License

The model and associated code are licensed under the Apache 2.0 License, allowing for both personal and commercial use.

More Related APIs in Automatic Speech Recognition