indicwav2vec hindi

ai4bharat

Introduction

The indicwav2vec-hindi model is an Automatic Speech Recognition (ASR) model designed for the Hindi language, utilizing the Wav2Vec2 architecture. It is developed by AI4Bharat and is available on Hugging Face. The model is trained using fairseq and is compatible with PyTorch. It is licensed under Apache License 2.0.

Architecture

The model is based on the Wav2Vec2 architecture, which is designed for processing audio data to perform automatic speech recognition tasks. The model processes audio inputs to generate text outputs through a sequence of convolutional and transformer layers, optimized for the Hindi language.

Training

The model was trained using the fairseq library and has been ported to the Hugging Face ecosystem. It was developed to handle Hindi language datasets, and details regarding the training setup and dataset usage can be found in the IndicWav2Vec GitHub repository. However, it does not support inference with a Language Model.

Guide: Running Locally

To run the indicwav2vec-hindi model locally, follow these steps:

  1. Install Dependencies: Ensure you have PyTorch, the transformers, and datasets libraries installed.
  2. Load Dataset: Use the datasets library to stream a Hindi language dataset, such as "common_voice".
  3. Resample Audio: Adjust the audio sampling rate to 16 kHz using torchaudio.
  4. Load Model and Processor:
    • Load the model using AutoModelForCTC.
    • Load the processor using AutoProcessor.
  5. Inference:
    • Process the resampled audio through the processor to obtain input values.
    • Make predictions using the model in a no-grad context to optimize performance.
    • Decode the predictions to obtain the transcription.
import torch
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torchaudio.functional as F

DEVICE_ID = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_ID = "ai4bharat/indicwav2vec-hindi"

sample = next(iter(load_dataset("common_voice", "hi", split="test", streaming=True)))
resampled_audio = F.resample(torch.tensor(sample["audio"]["array"]), 48000, 16000).numpy()

model = AutoModelForCTC.from_pretrained(MODEL_ID).to(DEVICE_ID)
processor = AutoProcessor.from_pretrained(MODEL_ID)

input_values = processor(resampled_audio, return_tensors="pt").input_values

with torch.no_grad():
    logits = model(input_values.to(DEVICE_ID)).logits.cpu()
    
prediction_ids = torch.argmax(logits, dim=-1)
output_str = processor.batch_decode(prediction_ids)[0]
print(f"Greedy Decoding: {output_str}")

For optimal performance, consider using a cloud GPU service, such as AWS, Google Cloud, or Azure, to accelerate computation.

License

The indicwav2vec-hindi model is distributed under the Apache License 2.0, which allows for both personal and commercial use, distribution, modification, and private use, with proper attribution to the authors.

More Related APIs in Automatic Speech Recognition