wav2vec2 large xlsr cantonese

ctl

Introduction

The WAV2VEC2-LARGE-XLSR-CANTONESE model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53 designed for automatic speech recognition in Cantonese. It utilizes the Common Voice dataset for training and supports input speech sampled at 16kHz.

Architecture

The model is built on the Wav2Vec 2.0 framework, which is a popular architecture for speech recognition tasks. It operates by processing audio inputs to predict corresponding text outputs. This model specifically targets the Cantonese language using the zh-HK dataset from Common Voice.

Training

The model was trained using the Common Voice Cantonese dataset, employing both training and validation subsets. The training process aimed to optimize the model for accurate speech-to-text conversion, achieving a test Character Error Rate (CER) of approximately 15.36%.

Guide: Running Locally

  1. Installation: Ensure Python is installed, and set up the required libraries:
    pip install torch torchaudio transformers datasets
    
  2. Load Dataset: Use the Common Voice zh-HK test subset:
    from datasets import load_dataset
    test_dataset = load_dataset("common_voice", "zh-HK", split="test[:2%]")
    
  3. Initialize Processor and Model:
    from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
    processor = Wav2Vec2Processor.from_pretrained("ctl/wav2vec2-large-xlsr-cantonese")
    model = Wav2Vec2ForCTC.from_pretrained("ctl/wav2vec2-large-xlsr-cantonese")
    
  4. Preprocess Audio: Resample and convert audio files:
    import torchaudio
    resampler = torchaudio.transforms.Resample(48_000, 16_000)
    def speech_file_to_array_fn(batch):
        speech_array, sampling_rate = torchaudio.load(batch["path"])
        batch["speech"] = resampler(speech_array).squeeze().numpy()
        return batch
    test_dataset = test_dataset.map(speech_file_to_array_fn)
    
  5. Model Inference: Perform inference on the processed data:
    inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    print("Prediction:", processor.batch_decode(predicted_ids))
    
  6. Evaluation: Evaluate using CER metric:
    from datasets import load_metric
    cer = load_metric("cer")
    result = test_dataset.map(evaluate, batched=True, batch_size=16)
    print("CER: {:2f}".format(100 * cer.compute(predictions=result["pred_strings"], references=result["sentence"])))
    

Cloud GPUs: For faster processing, consider using cloud-based GPU services like AWS EC2 with NVIDIA GPUs, Google Cloud Platform, or Azure.

License

The model is available under the Apache 2.0 license, allowing for both personal and commercial use, modification, and distribution.

More Related APIs in Automatic Speech Recognition