wav2vec2 large xlsr indonesian

indonesian-nlp

Introduction

The Wav2Vec2-Large-XLSR-Indonesian model is designed for automatic speech recognition for the Indonesian language. This model is a fine-tuned version of the facebook/wav2vec2-large-xlsr-53 on the Indonesian Common Voice dataset, which should be used with speech input sampled at 16kHz.

Architecture

The model architecture is based on the Wav2Vec2 framework, leveraging its robust capabilities for processing speech data. It uses a transformer-based approach to convert raw audio inputs into text, fine-tuned specifically for the Indonesian language using the Common Voice dataset.

Training

The model was trained using the Common Voice dataset, including training, validation, and synthetic voice datasets. The training script is available on GitHub for those interested in replicating or building upon this work.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install Required Libraries: Ensure you have torch, torchaudio, datasets, and transformers installed.

  2. Load the Dataset: Use the datasets library to load the Indonesian Common Voice test dataset.

  3. Preprocess Input Data: Convert audio files to the required format and sample rate using torchaudio.

  4. Load the Model and Processor: Retrieve the pre-trained model and processor from the Hugging Face model hub.

  5. Run Inference: Use the model for inference by processing the input speech data and decoding the predictions.

  6. Evaluate: Optionally, evaluate the model's performance using the Word Error Rate (WER) metric.

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "id", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("indonesian-nlp/wav2vec2-large-xlsr-indonesian")
model = Wav2Vec2ForCTC.from_pretrained("indonesian-nlp/wav2vec2-large-xlsr-indonesian")

# Preprocessing the datasets
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))

Cloud GPUs

For more efficient computation, especially large batch processing, consider using cloud GPU services such as AWS, GCP, or Azure.

License

This model is distributed under the Apache-2.0 license, allowing for both personal and commercial use, provided that the conditions of the license are met.

More Related APIs in Automatic Speech Recognition