WAV2VEC2-XLS-R-300M-PHONEME

Introduction

WAV2VEC2-XLS-R-300M-PHONEME is a fine-tuned version of the facebook/wav2vec2-xls-r-300m model, designed for automatic speech recognition tasks. The model demonstrates a loss of 0.3327 and a character error rate (CER) of 0.1332 on the evaluation set.

Architecture

The model utilizes the wav2vec2 architecture, which is built on top of the Transformers library and is compatible with PyTorch. It supports safe tensor formats and is compatible with inference endpoints.

Training

Training Procedure

The model was fine-tuned using the following hyperparameters:

Learning Rate: 3e-05
Train Batch Size: 8
Evaluation Batch Size: 8
Gradient Accumulation Steps: 4
Total Train Batch Size: 32
Optimizer: Adam with betas (0.9, 0.999) and epsilon 1e-08
Learning Rate Scheduler Type: Linear
Warmup Steps: 2000
Training Steps: 7000
Mixed Precision Training: Native AMP

Training Results

Training progress is documented as follows:

Epoch	Step	Training Loss	Validation Loss	CER
1.32	1000	3.4324	3.3693	0.9091
2.65	2000	2.1751	1.1382	0.2397
3.97	3000	1.3986	0.4886	0.1452
5.3	4000	1.2285	0.3842	0.1351
6.62	5000	1.142	0.3505	0.1349
7.95	6000	1.1075	0.3323	0.1317
9.27	7000	1.0867	0.3265	0.1315

Framework Versions

Transformers: 4.17.0.dev0
PyTorch: 1.10.2+cu102
Datasets: 1.18.2.dev0
Tokenizers: 0.11.0

Guide: Running Locally

To run this model locally, follow these steps:

Install Dependencies:
Ensure Python and PyTorch are installed. Use pip to install Transformers and other required libraries.
```
pip install torch transformers datasets
```

Load the Model:
Use the Transformers library to load the model.

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
model = Wav2Vec2ForCTC.from_pretrained("vitouphy/wav2vec2-xls-r-300m-phoneme")
processor = Wav2Vec2Processor.from_pretrained("vitouphy/wav2vec2-xls-r-300m-phoneme")

Inference:
Prepare audio data, process it, and run inference.
```
# Add code to preprocess audio and run inference
```

Suggestion: Cloud GPUs

For optimal performance, consider using cloud services such as AWS, Google Cloud Platform, or Azure for GPU support.

License

The model is licensed under the Apache 2.0 License.

More Related APIs in Automatic Speech Recognition

wav2vec2 xls r 300m phoneme