wav2vec2 large xls r 300m ha cv8

anuragshas

Introduction

The WAV2VEC2-LARGE-XLS-R-300M-HA-CV8 model is a fine-tuned version of Facebook's wav2vec2-xls-r-300m, specifically trained for Automatic Speech Recognition (ASR) in the Hausa language using the Common Voice 8.0 dataset. It employs the PyTorch framework and is compatible with the Transformers library.

Architecture

This model is based on the wav2vec2 architecture, optimized for robust speech event handling. It incorporates a large transformer structure with 300 million parameters designed to efficiently process audio data for ASR tasks.

Training

Training Procedure

The model was trained using the Common Voice 8.0 dataset. Key hyperparameters include:

  • Learning Rate: 0.0001
  • Train Batch Size: 16
  • Eval Batch Size: 8
  • Total Train Batch Size: 32
  • Optimizer: Adam with betas (0.9, 0.999)
  • Learning Rate Scheduler: Cosine with Restarts
  • Scheduler Warmup Steps: 1000
  • Number of Epochs: 100

Training Results

  • Loss: 0.6094
  • WER: 0.5234

Evaluation Metrics

  • Test WER: 36.295
  • Test CER: 11.073

Framework Versions

  • Transformers 4.16.1
  • PyTorch 1.10.0+cu111
  • Datasets 1.18.2
  • Tokenizers 0.11.0

Guide: Running Locally

Basic Steps

  1. Install Required Libraries: Ensure you have the necessary Python libraries installed, including torch, transformers, datasets, and torchaudio.
  2. Load Dataset: Use the datasets library to load the Common Voice 8.0 dataset for the Hausa language.
  3. Model Initialization: Initialize the model and processor using AutoModelForCTC and AutoProcessor from the transformers library.
  4. Audio Preprocessing: Resample audio to 16kHz as required by the model.
  5. Inference: Run inference using the model and processor to convert audio to text.

Cloud GPUs

For optimal performance and faster computation, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure.

License

The model is licensed under the Apache 2.0 License, allowing for wide usage and modification in both commercial and non-commercial applications.

More Related APIs in Automatic Speech Recognition