wav2vec2 xlsr multilingual 56

voidful

Introduction

The WAV2VEC2-XLSR-MULTILINGUAL-56 model is a multilingual automatic speech recognition model developed by Voidful. It is designed to handle speech recognition tasks across 56 languages using a single model, leveraging the capabilities of the wav2vec2 framework. The model is provided under the Apache-2.0 license.

Architecture

The model is based on the wav2vec2 architecture, fine-tuned to work with 56 languages using the Common Voice dataset. It utilizes a convolutional neural network (CNN) for feature extraction and a transformer architecture for contextual representation.

Training

Training Data

The model is fine-tuned on the Common Voice dataset, which includes audio data from 56 different languages. The base model used for fine-tuning is facebook/wav2vec2-large-xlsr-53.

Training Procedure

  • Preprocessing: Speech input must be sampled at 16kHz.
  • Model Evaluation: The model's performance is measured using Character Error Rate (CER) and Word Error Rate (WER) across various languages.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Environment Setup

    !pip install torchaudio
    !pip install datasets transformers
    !pip install asrp
    !wget -O lang_ids.pk https://huggingface.co/voidful/wav2vec2-xlsr-multilingual-56/raw/main/lang_ids.pk
    
  2. Model Inference

    import torchaudio
    from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
    
    model_name = "voidful/wav2vec2-xlsr-multilingual-56"
    device = "cuda"
    
    model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
    processor = Wav2Vec2Processor.from_pretrained(model_name)
    
    def load_file_to_data(file, sampling_rate=16_000):
        speech, _ = torchaudio.load(file)
        return {"speech": speech.squeeze(0).numpy(), "sampling_rate": sampling_rate}
    
    def predict(data):
        features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
        input_values = features.input_values.to(device)
        with torch.no_grad():
            logits = model(input_values).logits
            pred_ids = torch.argmax(logits, dim=-1)
            return processor.decode(pred_ids[0])
    
    data = load_file_to_data('path/to/audio/file.wav')
    print(predict(data))
    
  3. Hardware Recommendations: It is suggested to use a cloud GPU such as those available on AWS, Google Cloud, or Azure to speed up processing, especially for large datasets or real-time inference.

License

The model is licensed under the Apache-2.0 License, allowing for both personal and commercial use with adherence to the license terms.

More Related APIs in Automatic Speech Recognition