romanian wav2vec2

gigant

Introduction

The Romanian Wav2Vec2 model is a specialized automatic speech recognition (ASR) model fine-tuned for the Romanian language. It is based on the facebook/wav2vec2-xls-r-300m architecture and utilizes datasets such as Common Voice 8.0 and Romanian Speech Synthesis. It is designed for converting Romanian audio input into text, optimized with a 5-gram language model.

Architecture

This model is built upon the facebook/wav2vec2-xls-r-300m architecture, featuring a speech recognition CTC head. It incorporates a 5-gram language model using pyctcdecode and kenlm, trained on the Romanian Corpora Parliament dataset. This setup enhances decoding accuracy by leveraging language model capabilities.

Training

The model was trained using the Common Voice 8.0 (Romanian subset) and Romanian Speech Synthesis datasets. Key hyperparameters include:

  • Learning rate: 0.003
  • Train batch size: 16
  • Eval batch size: 8
  • Seed: 42
  • Gradient accumulation steps: 3
  • Total train batch size: 48
  • Optimizer: Adam (betas=(0.9, 0.999), epsilon=1e-08)
  • Scheduler type: linear
  • Warmup steps: 500
  • Epochs: 50
  • Mixed precision training: Native AMP

The model achieves significant improvements in word error rate (WER) and character error rate (CER) over the training period.

Guide: Running Locally

To run this model locally:

  1. Install dependencies:

    pip install https://github.com/kpu/kenlm/archive/master.zip pyctcdecode
    
  2. Load the model:

    from transformers import AutoProcessor, AutoModelForCTC
    
    processor = AutoProcessor.from_pretrained("gigant/romanian-wav2vec2")
    model = AutoModelForCTC.from_pretrained("gigant/romanian-wav2vec2")
    
  3. Use the ASR pipeline:

    from transformers import pipeline
    
    asr = pipeline("automatic-speech-recognition", model="gigant/romanian-wav2vec2")
    
  4. Resample audio if needed:

    import torchaudio
    audio = sample["audio"]["array"]
    rate = sample["audio"]["sampling_rate"]
    resampler = torchaudio.transforms.Resample(rate, 16000)
    audio_16 = resampler(torch.Tensor(audio)).numpy()
    
  5. Predict text from audio:

    predicted_text = asr(audio_16)
    

For optimal performance and speed, consider using cloud GPU resources such as AWS, Google Cloud, or Azure.

License

This model is distributed under the Apache-2.0 license, allowing for both personal and commercial use with proper attribution.

More Related APIs in Automatic Speech Recognition