wav2vec2 large english T I M I T phoneme_v3

speech31

Introduction

The wav2vec2-large-english-TIMIT-phoneme_v3 model is a fine-tuned version of the facebook/wav2vec2-large model, optimized for phonemic recognition on the TIMIT dataset. It is designed for Automatic Speech Recognition (ASR) tasks, leveraging the powerful wav2vec2 architecture.

Architecture

The model is based on the wav2vec2 architecture, which is a Transformer-based model for speech recognition. It is implemented using the Transformers library and is compatible with PyTorch.

Training

Training and Evaluation Data

  • Training Dataset: TIMIT dataset (training + validation set)
  • Evaluation Dataset: TIMIT dataset (test set)

Training Hyperparameters

  • Learning Rate: 0.0003
  • Train Batch Size: 32
  • Eval Batch Size: 16
  • Seed: 42
  • Gradient Accumulation Steps: 2
  • Total Train Batch Size: 64
  • Optimizer: Adam (betas=(0.9, 0.999), epsilon=1e-08)
  • Learning Rate Scheduler: Linear
  • Warmup Steps: 1000
  • Number of Epochs: 50
  • Mixed Precision Training: Native AMP

Training Results

  • Training Loss at Epoch and Steps:

    • Per: 2.2678, Step: 500, Validation Loss: 0.2347
    • Per: 6.94, Step: 1000, Validation Loss: 0.3358
    • Per: 13.88, Step: 1500, Validation Loss: 0.3865
    • Per: 20.83, Step: 2000, Validation Loss: 0.4162
    • Per: 27.77, Step: 2500, Validation Loss: 0.4429
    • Per: 34.72, Step: 3000, Validation Loss: 0.3697
  • Final Evaluation Metrics:

    • Loss: 0.3697
    • CER (Character Error Rate): 0.0987

Framework Versions

  • Transformers: 4.23.0.dev0
  • PyTorch: 1.12.1.post201
  • Datasets: 2.5.2.dev0
  • Tokenizers: 0.12.1

Guide: Running Locally

  1. Setup Environment:

    • Ensure you have Python and PyTorch installed.
    • Install the Transformers and Datasets libraries.
  2. Install Model Requirements:

    pip install transformers datasets
    
  3. Load the Model:

    from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
    
    model_name = "speech31/wav2vec2-large-english-TIMIT-phoneme_v3"
    model = Wav2Vec2ForCTC.from_pretrained(model_name)
    tokenizer = Wav2Vec2Tokenizer.from_pretrained(model_name)
    
  4. Inference:

    • Use the model and tokenizer to perform inference on audio files.
  5. Cloud GPUs:

    • For efficient training and inference, consider using cloud GPU providers like AWS, GCP, or Azure.

License

This model is licensed under the Apache 2.0 License.

More Related APIs in Automatic Speech Recognition