wav2vec2 large xlsr korean

kresnik

Wav2Vec2-Large-XLSR-Korean

Introduction

The Wav2Vec2-Large-XLSR-Korean model is a specialized model for automatic speech recognition (ASR) in the Korean language. It utilizes the Wav2Vec2 architecture, optimized for processing Korean speech data, and achieves notable performance metrics with a Word Error Rate (WER) of 4.74% and a Character Error Rate (CER) of 1.78% on the Zeroth Korean dataset.

Architecture

This model leverages the Wav2Vec2 architecture, which is a transformer-based model designed for speech recognition tasks. It processes audio inputs to generate text transcriptions, making it suitable for converting spoken Korean into written form. The model is built using PyTorch and is compatible with the Hugging Face Transformers library.

Training

The model was trained on the Zeroth Korean dataset, focusing on clean speech data to enhance its accuracy for Korean ASR tasks. The training process involved fine-tuning the Wav2Vec2 architecture on this specific dataset, ensuring that the model could accurately transcribe Korean speech with minimal errors.

Guide: Running Locally

To run the Wav2Vec2-Large-XLSR-Korean model locally, follow these steps:

  1. Set Up Environment

    • Install the required Python libraries: transformers, datasets, soundfile, and torch.
  2. Load the Model and Processor

    from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
    processor = Wav2Vec2Processor.from_pretrained("kresnik/wav2vec2-large-xlsr-korean")
    model = Wav2Vec2ForCTC.from_pretrained("kresnik/wav2vec2-large-xlsr-korean").to('cuda')
    
  3. Load Dataset

    from datasets import load_dataset
    ds = load_dataset("kresnik/zeroth_korean", "clean")
    test_ds = ds['test']
    
  4. Process Audio Data

    import soundfile as sf
    def map_to_array(batch):
        speech, _ = sf.read(batch["file"])
        batch["speech"] = speech
        return batch
    test_ds = test_ds.map(map_to_array)
    
  5. Generate Transcriptions

    import torch
    def map_to_pred(batch):
        inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding="longest")
        input_values = inputs.input_values.to("cuda")
    
        with torch.no_grad():
            logits = model(input_values).logits
    
        predicted_ids = torch.argmax(logits, dim=-1)
        transcription = processor.batch_decode(predicted_ids)
        batch["transcription"] = transcription
        return batch
    
    result = test_ds.map(map_to_pred, batched=True, batch_size=16, remove_columns=["speech"])
    
  6. Evaluate Performance

    from jiwer import wer
    print("WER:", wer(result["text"], result["transcription"]))
    

Cloud GPUs

For optimal performance, especially when dealing with large datasets, it is recommended to use cloud GPU services such as Google Colab, AWS EC2 with GPU instances, or Azure's GPU VMs.

License

The Wav2Vec2-Large-XLSR-Korean model is released under the Apache 2.0 License, which permits use, distribution, and modification with appropriate attribution.

More Related APIs in Automatic Speech Recognition