wav2vec2 large english T I M I T phoneme_v3
speech31Introduction
The wav2vec2-large-english-TIMIT-phoneme_v3
model is a fine-tuned version of the facebook/wav2vec2-large
model, optimized for phonemic recognition on the TIMIT dataset. It is designed for Automatic Speech Recognition (ASR) tasks, leveraging the powerful wav2vec2
architecture.
Architecture
The model is based on the wav2vec2
architecture, which is a Transformer-based model for speech recognition. It is implemented using the Transformers
library and is compatible with PyTorch
.
Training
Training and Evaluation Data
- Training Dataset: TIMIT dataset (training + validation set)
- Evaluation Dataset: TIMIT dataset (test set)
Training Hyperparameters
- Learning Rate: 0.0003
- Train Batch Size: 32
- Eval Batch Size: 16
- Seed: 42
- Gradient Accumulation Steps: 2
- Total Train Batch Size: 64
- Optimizer: Adam (betas=(0.9, 0.999), epsilon=1e-08)
- Learning Rate Scheduler: Linear
- Warmup Steps: 1000
- Number of Epochs: 50
- Mixed Precision Training: Native AMP
Training Results
-
Training Loss at Epoch and Steps:
- Per: 2.2678, Step: 500, Validation Loss: 0.2347
- Per: 6.94, Step: 1000, Validation Loss: 0.3358
- Per: 13.88, Step: 1500, Validation Loss: 0.3865
- Per: 20.83, Step: 2000, Validation Loss: 0.4162
- Per: 27.77, Step: 2500, Validation Loss: 0.4429
- Per: 34.72, Step: 3000, Validation Loss: 0.3697
-
Final Evaluation Metrics:
- Loss: 0.3697
- CER (Character Error Rate): 0.0987
Framework Versions
- Transformers: 4.23.0.dev0
- PyTorch: 1.12.1.post201
- Datasets: 2.5.2.dev0
- Tokenizers: 0.12.1
Guide: Running Locally
-
Setup Environment:
- Ensure you have Python and PyTorch installed.
- Install the
Transformers
andDatasets
libraries.
-
Install Model Requirements:
pip install transformers datasets
-
Load the Model:
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer model_name = "speech31/wav2vec2-large-english-TIMIT-phoneme_v3" model = Wav2Vec2ForCTC.from_pretrained(model_name) tokenizer = Wav2Vec2Tokenizer.from_pretrained(model_name)
-
Inference:
- Use the model and tokenizer to perform inference on audio files.
-
Cloud GPUs:
- For efficient training and inference, consider using cloud GPU providers like AWS, GCP, or Azure.
License
This model is licensed under the Apache 2.0 License.