wav2vec2 xlsr multilingual 56
voidfulIntroduction
The WAV2VEC2-XLSR-MULTILINGUAL-56
model is a multilingual automatic speech recognition model developed by Voidful. It is designed to handle speech recognition tasks across 56 languages using a single model, leveraging the capabilities of the wav2vec2
framework. The model is provided under the Apache-2.0 license.
Architecture
The model is based on the wav2vec2
architecture, fine-tuned to work with 56 languages using the Common Voice dataset. It utilizes a convolutional neural network (CNN) for feature extraction and a transformer architecture for contextual representation.
Training
Training Data
The model is fine-tuned on the Common Voice dataset, which includes audio data from 56 different languages. The base model used for fine-tuning is facebook/wav2vec2-large-xlsr-53
.
Training Procedure
- Preprocessing: Speech input must be sampled at 16kHz.
- Model Evaluation: The model's performance is measured using Character Error Rate (CER) and Word Error Rate (WER) across various languages.
Guide: Running Locally
To run the model locally, follow these steps:
-
Environment Setup
!pip install torchaudio !pip install datasets transformers !pip install asrp !wget -O lang_ids.pk https://huggingface.co/voidful/wav2vec2-xlsr-multilingual-56/raw/main/lang_ids.pk
-
Model Inference
import torchaudio from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor model_name = "voidful/wav2vec2-xlsr-multilingual-56" device = "cuda" model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device) processor = Wav2Vec2Processor.from_pretrained(model_name) def load_file_to_data(file, sampling_rate=16_000): speech, _ = torchaudio.load(file) return {"speech": speech.squeeze(0).numpy(), "sampling_rate": sampling_rate} def predict(data): features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt") input_values = features.input_values.to(device) with torch.no_grad(): logits = model(input_values).logits pred_ids = torch.argmax(logits, dim=-1) return processor.decode(pred_ids[0]) data = load_file_to_data('path/to/audio/file.wav') print(predict(data))
-
Hardware Recommendations: It is suggested to use a cloud GPU such as those available on AWS, Google Cloud, or Azure to speed up processing, especially for large datasets or real-time inference.
License
The model is licensed under the Apache-2.0 License, allowing for both personal and commercial use with adherence to the license terms.