wav2vec2 large xlsr 53 th

airesearch

Introduction

The WAV2VEC2-LARGE-XLSR-53-TH model is a fine-tuned version of the wav2vec2-large-xlsr-53 model for Thai Automatic Speech Recognition (ASR). It leverages the Common Voice 7.0 dataset to enhance performance for Thai language recognition tasks.

Architecture

The model is built on the wav2vec2-large-xlsr-53 architecture, utilizing the XLS-R framework for multilingual speech recognition. The fine-tuning process applies specific tokenizers like PyThaiNLP and deepcut to process Thai language data effectively.

Training

The model was fine-tuned using the Common Voice 7.0 dataset, which contains 133 validated hours of Thai audio. The training utilized a single V100 GPU, optimizing the model with a batch size of 32 and a learning rate of 1e-4 over 100 epochs. The model configuration involves dropout settings and gradient checkpointing to improve generalization and memory efficiency.

Guide: Running Locally

  1. Load the Pretrained Model and Processor:

    processor = Wav2Vec2Processor.from_pretrained("airesearch/wav2vec2-large-xlsr-53-th")
    model = Wav2Vec2ForCTC.from_pretrained("airesearch/wav2vec2-large-xlsr-53-th")
    
  2. Preprocess Audio Data: Use the torchaudio library to load and resample audio files to a 16 kHz sampling rate.

  3. Inference: Prepare input tensors and run inference:

    inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    print("Prediction:", processor.batch_decode(predicted_ids))
    
  4. Hardware Recommendations: For optimal performance, especially for training, consider using cloud GPUs such as NVIDIA V100 or A100 available on platforms like AWS, GCP, or Azure.

License

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (cc-by-sa-4.0).

More Related APIs in Automatic Speech Recognition