Introduction

The Fon XLSR model is a fine-tuned version of Facebook's Wav2Vec2-Large-XLSR-53 for Automatic Speech Recognition (ASR) in the Fon language. This model is trained using the Fon Dataset and is designed to handle speech sampled at 16kHz.

Architecture

The model is based on the Wav2Vec2 architecture, specifically the Wav2Vec2-Large-XLSR-53 model, which is adept at handling multilingual speech recognition tasks. The Fon XLSR model has been fine-tuned to recognize and transcribe spoken Fon language.

Training

The training data consists of the Fon dataset, which is divided into 8,235 training samples, 1,107 validation samples, and 1,061 test samples. The training script is available here. The model achieves a Word Error Rate (WER) of 14.97% on the test dataset.

Guide: Running Locally

  1. Setup Environment: Ensure you have Python installed with the necessary libraries such as torch, torchaudio, datasets, and transformers.

  2. Download Model: Load the pre-trained model and processor using:

    from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
    
    processor = Wav2Vec2Processor.from_pretrained("chrisjay/wav2vec2-large-xlsr-53-fon")
    model = Wav2Vec2ForCTC.from_pretrained("chrisjay/wav2vec2-large-xlsr-53-fon")
    
  3. Prepare Data: Load your audio files at a sampling rate of 16kHz and preprocess them as required.

  4. Run Inference:

    inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    predictions = processor.batch_decode(predicted_ids)
    
  5. GPU Recommendation: For efficient processing, particularly for large datasets or batch processing, using cloud GPUs such as those offered by Google Colab, AWS, or Azure is recommended.

License

The Fon XLSR model is released under the Apache 2.0 License, permitting use and modification with proper attribution.

More Related APIs in Automatic Speech Recognition