wav2vec2 xls r 300m phoneme

vitouphy

WAV2VEC2-XLS-R-300M-PHONEME

Introduction

WAV2VEC2-XLS-R-300M-PHONEME is a fine-tuned version of the facebook/wav2vec2-xls-r-300m model, designed for automatic speech recognition tasks. The model demonstrates a loss of 0.3327 and a character error rate (CER) of 0.1332 on the evaluation set.

Architecture

The model utilizes the wav2vec2 architecture, which is built on top of the Transformers library and is compatible with PyTorch. It supports safe tensor formats and is compatible with inference endpoints.

Training

Training Procedure

The model was fine-tuned using the following hyperparameters:

  • Learning Rate: 3e-05
  • Train Batch Size: 8
  • Evaluation Batch Size: 8
  • Gradient Accumulation Steps: 4
  • Total Train Batch Size: 32
  • Optimizer: Adam with betas (0.9, 0.999) and epsilon 1e-08
  • Learning Rate Scheduler Type: Linear
  • Warmup Steps: 2000
  • Training Steps: 7000
  • Mixed Precision Training: Native AMP

Training Results

Training progress is documented as follows:

Epoch Step Training Loss Validation Loss CER
1.32 1000 3.4324 3.3693 0.9091
2.65 2000 2.1751 1.1382 0.2397
3.97 3000 1.3986 0.4886 0.1452
5.3 4000 1.2285 0.3842 0.1351
6.62 5000 1.142 0.3505 0.1349
7.95 6000 1.1075 0.3323 0.1317
9.27 7000 1.0867 0.3265 0.1315

Framework Versions

  • Transformers: 4.17.0.dev0
  • PyTorch: 1.10.2+cu102
  • Datasets: 1.18.2.dev0
  • Tokenizers: 0.11.0

Guide: Running Locally

To run this model locally, follow these steps:

  1. Install Dependencies:
    Ensure Python and PyTorch are installed. Use pip to install Transformers and other required libraries.

    pip install torch transformers datasets
    
  2. Load the Model:
    Use the Transformers library to load the model.

    from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
    model = Wav2Vec2ForCTC.from_pretrained("vitouphy/wav2vec2-xls-r-300m-phoneme")
    processor = Wav2Vec2Processor.from_pretrained("vitouphy/wav2vec2-xls-r-300m-phoneme")
    
  3. Inference:
    Prepare audio data, process it, and run inference.

    # Add code to preprocess audio and run inference
    

Suggestion: Cloud GPUs

For optimal performance, consider using cloud services such as AWS, Google Cloud Platform, or Azure for GPU support.

License

The model is licensed under the Apache 2.0 License.

More Related APIs in Automatic Speech Recognition