wav2vec2 large xlsr kazakh
aismlvIntroduction
The WAV2VEC2-LARGE-XLSR-53-KAZAKH
model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53
for Automatic Speech Recognition (ASR) in the Kazakh language. It was trained using the Kazakh Speech Corpus v1.1.
Architecture
The model is based on the Wav2Vec2 architecture, specifically the facebook/wav2vec2-large-xlsr-53
variant, which is designed for multilingual and cross-lingual speech recognition tasks. The model processes audio inputs sampled at 16kHz.
Training
The model was trained using the Kazakh Speech Corpus v1.1. The fine-tuning process involved adapting the pre-trained Wav2Vec2 model to recognize and transcribe Kazakh speech, achieving a Word Error Rate (WER) of 19.65%.
Guide: Running Locally
-
Setup Environment: Ensure you have Python, PyTorch, Torchaudio, and Hugging Face Transformers installed.
-
Load Model and Processor:
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor processor = Wav2Vec2Processor.from_pretrained("wav2vec2-large-xlsr-kazakh") model = Wav2Vec2ForCTC.from_pretrained("wav2vec2-large-xlsr-kazakh")
-
Prepare Data:
- Use the Kazakh Speech Corpus v1.1.
- Ensure audio files are resampled to 16kHz.
-
Run Inference:
inputs = processor(audio_input, sampling_rate=16000, return_tensors="pt", padding=True) with torch.no_grad(): logits = model(inputs.input_values).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids)
-
Evaluation: Use the
wer
metric from Hugging Face'sdatasets
library to evaluate the model on a test set.
For faster processing, consider using cloud GPUs such as those available on AWS, GCP, or Azure.
License
The model is licensed under the Apache 2.0 License.