wav2vec2 large xlsr 53 arabic egyptian
arbmlIntroduction
The WAV2VEC2-LARGE-XLSR-53-ARABIC-EGYPTIAN model is designed for automatic speech recognition, fine-tuned for Egyptian Arabic using Facebook's wav2vec2-large-xlsr-53 and the Common Voice dataset.
Architecture
This model leverages the XLSR architecture of Wav2Vec2, which is a transformer-based model designed for speech recognition tasks. It is capable of processing audio inputs directly, producing textual transcriptions.
Training
The model was fine-tuned using the Common Voice dataset, specifically tailored for Egyptian Arabic. Details about the specific training methods employed and the parameters used are unspecified in the documentation.
Guide: Running Locally
Basic Steps
-
Install Required Libraries: Ensure
torch
,torchaudio
,datasets
, andtransformers
are installed. These can be installed via pip:pip install torch torchaudio datasets transformers
-
Load Dataset and Model: Use the
datasets
library to load the Common Voice dataset and thetransformers
library to load the pre-trained model and processor:from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor test_dataset = load_dataset("common_voice", "ar-EG", split="test[:2%]") processor = Wav2Vec2Processor.from_pretrained("Zaid/wav2vec2-large-xlsr-53-arabic-egyptian") model = Wav2Vec2ForCTC.from_pretrained("Zaid/wav2vec2-large-xlsr-53-arabic-egyptian")
-
Preprocess Audio Data: Convert audio files to the required sampling rate:
import torchaudio resampler = torchaudio.transforms.Resample(48_000, 16_000) def speech_file_to_array_fn(batch): speech_array, sampling_rate = torchaudio.load(batch["path"]) batch["speech"] = resampler(speech_array).squeeze().numpy() return batch test_dataset = test_dataset.map(speech_file_to_array_fn)
-
Run Inference: Process the audio inputs and decode the predictions:
import torch inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True) with torch.no_grad(): logits = model(inputs.input_values).logits predicted_ids = torch.argmax(logits, dim=-1) print("Prediction:", processor.batch_decode(predicted_ids)) print("Reference:", test_dataset["sentence"][:2])
Cloud GPU Suggestion
For efficient processing, consider using cloud-based GPU services like AWS EC2, Google Cloud, or Azure to handle large datasets or intensive computations.
License
The model and associated code are licensed under the Apache 2.0 License, allowing for both personal and commercial use.