wav2vec2 large english T I M I T phoneme_v3 LLM Model

Introduction

The wav2vec2-large-english-TIMIT-phoneme_v3 model is a fine-tuned version of the facebook/wav2vec2-large model, optimized for phonemic recognition on the TIMIT dataset. It is designed for Automatic Speech Recognition (ASR) tasks, leveraging the powerful wav2vec2 architecture.

Architecture

The model is based on the wav2vec2 architecture, which is a Transformer-based model for speech recognition. It is implemented using the Transformers library and is compatible with PyTorch.

Training

Training and Evaluation Data

Training Dataset: TIMIT dataset (training + validation set)
Evaluation Dataset: TIMIT dataset (test set)

Training Hyperparameters

Learning Rate: 0.0003
Train Batch Size: 32
Eval Batch Size: 16
Seed: 42
Gradient Accumulation Steps: 2
Total Train Batch Size: 64
Optimizer: Adam (betas=(0.9, 0.999), epsilon=1e-08)
Learning Rate Scheduler: Linear
Warmup Steps: 1000
Number of Epochs: 50
Mixed Precision Training: Native AMP

Training Results

Training Loss at Epoch and Steps:
- Per: 2.2678, Step: 500, Validation Loss: 0.2347
- Per: 6.94, Step: 1000, Validation Loss: 0.3358
- Per: 13.88, Step: 1500, Validation Loss: 0.3865
- Per: 20.83, Step: 2000, Validation Loss: 0.4162
- Per: 27.77, Step: 2500, Validation Loss: 0.4429
- Per: 34.72, Step: 3000, Validation Loss: 0.3697
Final Evaluation Metrics:
- Loss: 0.3697
- CER (Character Error Rate): 0.0987

Framework Versions

Transformers: 4.23.0.dev0
PyTorch: 1.12.1.post201
Datasets: 2.5.2.dev0
Tokenizers: 0.12.1

Guide: Running Locally

Setup Environment:
- Ensure you have Python and PyTorch installed.
- Install the Transformers and Datasets libraries.
Install Model Requirements:
```
pip install transformers datasets
```

Load the Model:

from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

model_name = "speech31/wav2vec2-large-english-TIMIT-phoneme_v3"
model = Wav2Vec2ForCTC.from_pretrained(model_name)
tokenizer = Wav2Vec2Tokenizer.from_pretrained(model_name)

Inference:
- Use the model and tokenizer to perform inference on audio files.
Cloud GPUs:
- For efficient training and inference, consider using cloud GPU providers like AWS, GCP, or Azure.

License

This model is licensed under the Apache 2.0 License.

More Related APIs in Automatic Speech Recognition