wav2vec2 large xlsr 53 polish

jonatasgrosman

Introduction

The wav2vec2-large-xlsr-53-polish model is a fine-tuned version of the facebook/wav2vec2-large-xlsr-53 model for Automatic Speech Recognition (ASR) in Polish. It was trained on the Common Voice dataset, specifically the Polish language subset, to enhance speech recognition capabilities.

Architecture

This model leverages the Wav2Vec2 architecture, known for its ability to perform speech recognition tasks efficiently. It is built on top of the PyTorch library and utilizes the transformers framework. The model is designed to process audio inputs sampled at 16kHz.

Training

The model was fine-tuned on the Polish dataset from Common Voice 6.1. Training involved using GPU credits provided by OVHcloud. The training script is available on GitHub, allowing for reproducibility and further experimentation.

Guide: Running Locally

Basic Steps

  1. Installation: Ensure you have Python and the necessary libraries installed. You can install the transformers, datasets, and librosa libraries using pip:

    pip install transformers datasets librosa
    
  2. Download and Load Model:

    from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
    
    MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-polish"
    processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
    model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
    
  3. Preprocess Audio Data: Use librosa to load your audio files and preprocess them.

    import librosa
    
    def speech_file_to_array_fn(path):
        speech_array, sampling_rate = librosa.load(path, sr=16_000)
        return speech_array
    
  4. Inference: Transcribe audio files.

    import torch
    
    audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
    inputs = processor(audio_paths, sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcriptions = processor.batch_decode(predicted_ids)
    

Cloud GPUs

For enhanced performance, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or OVHcloud, which offer dedicated resources for intensive computational tasks like model training and inference.

License

This model is released under the Apache 2.0 License, allowing for both personal and commercial use, modification, and distribution. Ensure compliance with the license terms when using the model.

More Related APIs in Automatic Speech Recognition