wav2vec2 xlsr multilingual 56 LLM Model

Introduction

The WAV2VEC2-XLSR-MULTILINGUAL-56 model is a multilingual automatic speech recognition model developed by Voidful. It is designed to handle speech recognition tasks across 56 languages using a single model, leveraging the capabilities of the wav2vec2 framework. The model is provided under the Apache-2.0 license.

Architecture

The model is based on the wav2vec2 architecture, fine-tuned to work with 56 languages using the Common Voice dataset. It utilizes a convolutional neural network (CNN) for feature extraction and a transformer architecture for contextual representation.

Training

Training Data

The model is fine-tuned on the Common Voice dataset, which includes audio data from 56 different languages. The base model used for fine-tuning is facebook/wav2vec2-large-xlsr-53.

Training Procedure

Preprocessing: Speech input must be sampled at 16kHz.
Model Evaluation: The model's performance is measured using Character Error Rate (CER) and Word Error Rate (WER) across various languages.

Guide: Running Locally

To run the model locally, follow these steps:

Environment Setup

!pip install torchaudio
!pip install datasets transformers
!pip install asrp
!wget -O lang_ids.pk https://huggingface.co/voidful/wav2vec2-xlsr-multilingual-56/raw/main/lang_ids.pk

Model Inference

import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

model_name = "voidful/wav2vec2-xlsr-multilingual-56"
device = "cuda"

model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2Processor.from_pretrained(model_name)

def load_file_to_data(file, sampling_rate=16_000):
    speech, _ = torchaudio.load(file)
    return {"speech": speech.squeeze(0).numpy(), "sampling_rate": sampling_rate}

def predict(data):
    features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
    input_values = features.input_values.to(device)
    with torch.no_grad():
        logits = model(input_values).logits
        pred_ids = torch.argmax(logits, dim=-1)
        return processor.decode(pred_ids[0])

data = load_file_to_data('path/to/audio/file.wav')
print(predict(data))

Hardware Recommendations: It is suggested to use a cloud GPU such as those available on AWS, Google Cloud, or Azure to speed up processing, especially for large datasets or real-time inference.

License

The model is licensed under the Apache-2.0 License, allowing for both personal and commercial use with adherence to the license terms.

More Related APIs in Automatic Speech Recognition