wav2vec2 large xlsr kazakh

aismlv

Introduction

The WAV2VEC2-LARGE-XLSR-53-KAZAKH model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53 for Automatic Speech Recognition (ASR) in the Kazakh language. It was trained using the Kazakh Speech Corpus v1.1.

Architecture

The model is based on the Wav2Vec2 architecture, specifically the facebook/wav2vec2-large-xlsr-53 variant, which is designed for multilingual and cross-lingual speech recognition tasks. The model processes audio inputs sampled at 16kHz.

Training

The model was trained using the Kazakh Speech Corpus v1.1. The fine-tuning process involved adapting the pre-trained Wav2Vec2 model to recognize and transcribe Kazakh speech, achieving a Word Error Rate (WER) of 19.65%.

Guide: Running Locally

  1. Setup Environment: Ensure you have Python, PyTorch, Torchaudio, and Hugging Face Transformers installed.

  2. Load Model and Processor:

    from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
    processor = Wav2Vec2Processor.from_pretrained("wav2vec2-large-xlsr-kazakh")
    model = Wav2Vec2ForCTC.from_pretrained("wav2vec2-large-xlsr-kazakh")
    
  3. Prepare Data:

    • Use the Kazakh Speech Corpus v1.1.
    • Ensure audio files are resampled to 16kHz.
  4. Run Inference:

    inputs = processor(audio_input, sampling_rate=16000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    
  5. Evaluation: Use the wer metric from Hugging Face's datasets library to evaluate the model on a test set.

For faster processing, consider using cloud GPUs such as those available on AWS, GCP, or Azure.

License

The model is licensed under the Apache 2.0 License.

More Related APIs in Automatic Speech Recognition