wav2vec2 urdu LLM Model — Open LLM List

Introduction

WAV2VEC2-URDU is an Automatic Speech Recognition (ASR) model designed to transcribe Urdu language audio into text. The model is fine-tuned from the base model Harveenchadha/vakyansh-wav2vec2-urdu-urm-60 using the Common Voice dataset. The model is built using the Transformers library and is compatible with PyTorch.

Architecture

The model is a variant of the Wav2Vec2 architecture, specifically fine-tuned for the Urdu language. It leverages robust speech event capabilities and is evaluated using metrics such as Word Error Rate (WER) and Character Error Rate (CER).

Training

The model was trained on 0.58 hours of dataset data, utilizing the vakyansh-wav2vec2-urdu-urm-60 checkpoint due to the limited number of available samples. Key hyperparameters used during training include:

Learning Rate: 0.0003
Train Batch Size: 64
Eval Batch Size: 8
Optimizer: Adam with betas (0.9, 0.999) and epsilon 1e-08
Gradient Accumulation Steps: 2
Total Train Batch Size: 128
Number of Epochs: 100
Mixed Precision Training: Native AMP

Training results show a WER of 0.5747 and a CER of 0.3268.

Guide: Running Locally

To run the model locally, follow these steps:

Set Up Environment
Ensure you have Python installed, and set up a virtual environment.

Install Dependencies
Install the necessary libraries:

pip install transformers==4.16.0.dev0
pip install torch==1.10.1+cu102
pip install datasets==1.17.1.dev0
pip install tokenizers==0.11.0

Download Model
Use the Hugging Face Transformers library to download and load the model:

from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
model = Wav2Vec2ForCTC.from_pretrained("kingabzpro/wav2vec2-urdu")
tokenizer = Wav2Vec2Tokenizer.from_pretrained("kingabzpro/wav2vec2-urdu")

Inference
Run inference on your audio files to transcribe the speech into text.

For enhanced performance, it is recommended to use cloud GPUs such as AWS EC2 instances with NVIDIA GPUs or Google Cloud Platform's AI Platform.

License

The model is licensed under the Apache-2.0 license, allowing for both personal and commercial use with attribution.

More Related APIs in Automatic Speech Recognition