romanian wav2vec2
gigantIntroduction
The Romanian Wav2Vec2 model is a specialized automatic speech recognition (ASR) model fine-tuned for the Romanian language. It is based on the facebook/wav2vec2-xls-r-300m
architecture and utilizes datasets such as Common Voice 8.0 and Romanian Speech Synthesis. It is designed for converting Romanian audio input into text, optimized with a 5-gram language model.
Architecture
This model is built upon the facebook/wav2vec2-xls-r-300m
architecture, featuring a speech recognition CTC head. It incorporates a 5-gram language model using pyctcdecode
and kenlm
, trained on the Romanian Corpora Parliament dataset. This setup enhances decoding accuracy by leveraging language model capabilities.
Training
The model was trained using the Common Voice 8.0 (Romanian subset) and Romanian Speech Synthesis datasets. Key hyperparameters include:
- Learning rate: 0.003
- Train batch size: 16
- Eval batch size: 8
- Seed: 42
- Gradient accumulation steps: 3
- Total train batch size: 48
- Optimizer: Adam (betas=(0.9, 0.999), epsilon=1e-08)
- Scheduler type: linear
- Warmup steps: 500
- Epochs: 50
- Mixed precision training: Native AMP
The model achieves significant improvements in word error rate (WER) and character error rate (CER) over the training period.
Guide: Running Locally
To run this model locally:
-
Install dependencies:
pip install https://github.com/kpu/kenlm/archive/master.zip pyctcdecode
-
Load the model:
from transformers import AutoProcessor, AutoModelForCTC processor = AutoProcessor.from_pretrained("gigant/romanian-wav2vec2") model = AutoModelForCTC.from_pretrained("gigant/romanian-wav2vec2")
-
Use the ASR pipeline:
from transformers import pipeline asr = pipeline("automatic-speech-recognition", model="gigant/romanian-wav2vec2")
-
Resample audio if needed:
import torchaudio audio = sample["audio"]["array"] rate = sample["audio"]["sampling_rate"] resampler = torchaudio.transforms.Resample(rate, 16000) audio_16 = resampler(torch.Tensor(audio)).numpy()
-
Predict text from audio:
predicted_text = asr(audio_16)
For optimal performance and speed, consider using cloud GPU resources such as AWS, Google Cloud, or Azure.
License
This model is distributed under the Apache-2.0 license, allowing for both personal and commercial use with proper attribution.