wav2vec2 large xlsr cantonese
ctlIntroduction
The WAV2VEC2-LARGE-XLSR-CANTONESE model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53
designed for automatic speech recognition in Cantonese. It utilizes the Common Voice dataset for training and supports input speech sampled at 16kHz.
Architecture
The model is built on the Wav2Vec 2.0 framework, which is a popular architecture for speech recognition tasks. It operates by processing audio inputs to predict corresponding text outputs. This model specifically targets the Cantonese language using the zh-HK dataset from Common Voice.
Training
The model was trained using the Common Voice Cantonese dataset, employing both training and validation subsets. The training process aimed to optimize the model for accurate speech-to-text conversion, achieving a test Character Error Rate (CER) of approximately 15.36%.
Guide: Running Locally
- Installation: Ensure Python is installed, and set up the required libraries:
pip install torch torchaudio transformers datasets
- Load Dataset: Use the Common Voice
zh-HK
test subset:from datasets import load_dataset test_dataset = load_dataset("common_voice", "zh-HK", split="test[:2%]")
- Initialize Processor and Model:
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor processor = Wav2Vec2Processor.from_pretrained("ctl/wav2vec2-large-xlsr-cantonese") model = Wav2Vec2ForCTC.from_pretrained("ctl/wav2vec2-large-xlsr-cantonese")
- Preprocess Audio: Resample and convert audio files:
import torchaudio resampler = torchaudio.transforms.Resample(48_000, 16_000) def speech_file_to_array_fn(batch): speech_array, sampling_rate = torchaudio.load(batch["path"]) batch["speech"] = resampler(speech_array).squeeze().numpy() return batch test_dataset = test_dataset.map(speech_file_to_array_fn)
- Model Inference: Perform inference on the processed data:
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True) with torch.no_grad(): logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits predicted_ids = torch.argmax(logits, dim=-1) print("Prediction:", processor.batch_decode(predicted_ids))
- Evaluation: Evaluate using CER metric:
from datasets import load_metric cer = load_metric("cer") result = test_dataset.map(evaluate, batched=True, batch_size=16) print("CER: {:2f}".format(100 * cer.compute(predictions=result["pred_strings"], references=result["sentence"])))
Cloud GPUs: For faster processing, consider using cloud-based GPU services like AWS EC2 with NVIDIA GPUs, Google Cloud Platform, or Azure.
License
The model is available under the Apache 2.0 license, allowing for both personal and commercial use, modification, and distribution.