indicwav2vec hindi
ai4bharatIntroduction
The indicwav2vec-hindi
model is an Automatic Speech Recognition (ASR) model designed for the Hindi language, utilizing the Wav2Vec2 architecture. It is developed by AI4Bharat and is available on Hugging Face. The model is trained using fairseq and is compatible with PyTorch. It is licensed under Apache License 2.0.
Architecture
The model is based on the Wav2Vec2 architecture, which is designed for processing audio data to perform automatic speech recognition tasks. The model processes audio inputs to generate text outputs through a sequence of convolutional and transformer layers, optimized for the Hindi language.
Training
The model was trained using the fairseq library and has been ported to the Hugging Face ecosystem. It was developed to handle Hindi language datasets, and details regarding the training setup and dataset usage can be found in the IndicWav2Vec GitHub repository. However, it does not support inference with a Language Model.
Guide: Running Locally
To run the indicwav2vec-hindi
model locally, follow these steps:
- Install Dependencies: Ensure you have PyTorch, the
transformers
, anddatasets
libraries installed. - Load Dataset: Use the
datasets
library to stream a Hindi language dataset, such as "common_voice". - Resample Audio: Adjust the audio sampling rate to 16 kHz using
torchaudio
. - Load Model and Processor:
- Load the model using
AutoModelForCTC
. - Load the processor using
AutoProcessor
.
- Load the model using
- Inference:
- Process the resampled audio through the processor to obtain input values.
- Make predictions using the model in a no-grad context to optimize performance.
- Decode the predictions to obtain the transcription.
import torch
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torchaudio.functional as F
DEVICE_ID = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_ID = "ai4bharat/indicwav2vec-hindi"
sample = next(iter(load_dataset("common_voice", "hi", split="test", streaming=True)))
resampled_audio = F.resample(torch.tensor(sample["audio"]["array"]), 48000, 16000).numpy()
model = AutoModelForCTC.from_pretrained(MODEL_ID).to(DEVICE_ID)
processor = AutoProcessor.from_pretrained(MODEL_ID)
input_values = processor(resampled_audio, return_tensors="pt").input_values
with torch.no_grad():
logits = model(input_values.to(DEVICE_ID)).logits.cpu()
prediction_ids = torch.argmax(logits, dim=-1)
output_str = processor.batch_decode(prediction_ids)[0]
print(f"Greedy Decoding: {output_str}")
For optimal performance, consider using a cloud GPU service, such as AWS, Google Cloud, or Azure, to accelerate computation.
License
The indicwav2vec-hindi
model is distributed under the Apache License 2.0, which allows for both personal and commercial use, distribution, modification, and private use, with proper attribution to the authors.