wav2vec2 urdu

kingabzpro

Introduction

WAV2VEC2-URDU is an Automatic Speech Recognition (ASR) model designed to transcribe Urdu language audio into text. The model is fine-tuned from the base model Harveenchadha/vakyansh-wav2vec2-urdu-urm-60 using the Common Voice dataset. The model is built using the Transformers library and is compatible with PyTorch.

Architecture

The model is a variant of the Wav2Vec2 architecture, specifically fine-tuned for the Urdu language. It leverages robust speech event capabilities and is evaluated using metrics such as Word Error Rate (WER) and Character Error Rate (CER).

Training

The model was trained on 0.58 hours of dataset data, utilizing the vakyansh-wav2vec2-urdu-urm-60 checkpoint due to the limited number of available samples. Key hyperparameters used during training include:

  • Learning Rate: 0.0003
  • Train Batch Size: 64
  • Eval Batch Size: 8
  • Optimizer: Adam with betas (0.9, 0.999) and epsilon 1e-08
  • Gradient Accumulation Steps: 2
  • Total Train Batch Size: 128
  • Number of Epochs: 100
  • Mixed Precision Training: Native AMP

Training results show a WER of 0.5747 and a CER of 0.3268.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Set Up Environment
    Ensure you have Python installed, and set up a virtual environment.

  2. Install Dependencies
    Install the necessary libraries:

    pip install transformers==4.16.0.dev0
    pip install torch==1.10.1+cu102
    pip install datasets==1.17.1.dev0
    pip install tokenizers==0.11.0
    
  3. Download Model
    Use the Hugging Face Transformers library to download and load the model:

    from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
    model = Wav2Vec2ForCTC.from_pretrained("kingabzpro/wav2vec2-urdu")
    tokenizer = Wav2Vec2Tokenizer.from_pretrained("kingabzpro/wav2vec2-urdu")
    
  4. Inference
    Run inference on your audio files to transcribe the speech into text.

For enhanced performance, it is recommended to use cloud GPUs such as AWS EC2 instances with NVIDIA GPUs or Google Cloud Platform's AI Platform.

License

The model is licensed under the Apache-2.0 license, allowing for both personal and commercial use with attribution.

More Related APIs in Automatic Speech Recognition