wav2vec2 large xlsr 53 polish LLM Model

Introduction

The wav2vec2-large-xlsr-53-polish model is a fine-tuned version of the facebook/wav2vec2-large-xlsr-53 model for Automatic Speech Recognition (ASR) in Polish. It was trained on the Common Voice dataset, specifically the Polish language subset, to enhance speech recognition capabilities.

Architecture

This model leverages the Wav2Vec2 architecture, known for its ability to perform speech recognition tasks efficiently. It is built on top of the PyTorch library and utilizes the transformers framework. The model is designed to process audio inputs sampled at 16kHz.

Training

The model was fine-tuned on the Polish dataset from Common Voice 6.1. Training involved using GPU credits provided by OVHcloud. The training script is available on GitHub, allowing for reproducibility and further experimentation.

Guide: Running Locally

Basic Steps

Installation: Ensure you have Python and the necessary libraries installed. You can install the transformers, datasets, and librosa libraries using pip:
```
pip install transformers datasets librosa
```

Download and Load Model:

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-polish"
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

Preprocess Audio Data: Use librosa to load your audio files and preprocess them.

import librosa

def speech_file_to_array_fn(path):
    speech_array, sampling_rate = librosa.load(path, sr=16_000)
    return speech_array

Inference: Transcribe audio files.

import torch

audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
inputs = processor(audio_paths, sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcriptions = processor.batch_decode(predicted_ids)

Cloud GPUs

For enhanced performance, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or OVHcloud, which offer dedicated resources for intensive computational tasks like model training and inference.

License

This model is released under the Apache 2.0 License, allowing for both personal and commercial use, modification, and distribution. Ensure compliance with the license terms when using the model.

More Related APIs in Automatic Speech Recognition