wav2vec2 large xls r 300m ha cv8
anuragshasIntroduction
The WAV2VEC2-LARGE-XLS-R-300M-HA-CV8
model is a fine-tuned version of Facebook's wav2vec2-xls-r-300m
, specifically trained for Automatic Speech Recognition (ASR) in the Hausa language using the Common Voice 8.0 dataset. It employs the PyTorch framework and is compatible with the Transformers library.
Architecture
This model is based on the wav2vec2 architecture, optimized for robust speech event handling. It incorporates a large transformer structure with 300 million parameters designed to efficiently process audio data for ASR tasks.
Training
Training Procedure
The model was trained using the Common Voice 8.0 dataset. Key hyperparameters include:
- Learning Rate:
0.0001
- Train Batch Size:
16
- Eval Batch Size:
8
- Total Train Batch Size:
32
- Optimizer: Adam with betas
(0.9, 0.999)
- Learning Rate Scheduler: Cosine with Restarts
- Scheduler Warmup Steps:
1000
- Number of Epochs:
100
Training Results
- Loss: 0.6094
- WER: 0.5234
Evaluation Metrics
- Test WER: 36.295
- Test CER: 11.073
Framework Versions
- Transformers 4.16.1
- PyTorch 1.10.0+cu111
- Datasets 1.18.2
- Tokenizers 0.11.0
Guide: Running Locally
Basic Steps
- Install Required Libraries: Ensure you have the necessary Python libraries installed, including
torch
,transformers
,datasets
, andtorchaudio
. - Load Dataset: Use the
datasets
library to load the Common Voice 8.0 dataset for the Hausa language. - Model Initialization: Initialize the model and processor using
AutoModelForCTC
andAutoProcessor
from thetransformers
library. - Audio Preprocessing: Resample audio to 16kHz as required by the model.
- Inference: Run inference using the model and processor to convert audio to text.
Cloud GPUs
For optimal performance and faster computation, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure.
License
The model is licensed under the Apache 2.0 License, allowing for wide usage and modification in both commercial and non-commercial applications.