Wav2Vec2-Large-XLSR-Korean

Introduction

The Wav2Vec2-Large-XLSR-Korean model is a specialized model for automatic speech recognition (ASR) in the Korean language. It utilizes the Wav2Vec2 architecture, optimized for processing Korean speech data, and achieves notable performance metrics with a Word Error Rate (WER) of 4.74% and a Character Error Rate (CER) of 1.78% on the Zeroth Korean dataset.

Architecture

This model leverages the Wav2Vec2 architecture, which is a transformer-based model designed for speech recognition tasks. It processes audio inputs to generate text transcriptions, making it suitable for converting spoken Korean into written form. The model is built using PyTorch and is compatible with the Hugging Face Transformers library.

Training

The model was trained on the Zeroth Korean dataset, focusing on clean speech data to enhance its accuracy for Korean ASR tasks. The training process involved fine-tuning the Wav2Vec2 architecture on this specific dataset, ensuring that the model could accurately transcribe Korean speech with minimal errors.

Guide: Running Locally

To run the Wav2Vec2-Large-XLSR-Korean model locally, follow these steps:

Set Up Environment
- Install the required Python libraries: transformers, datasets, soundfile, and torch.

Load the Model and Processor

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
processor = Wav2Vec2Processor.from_pretrained("kresnik/wav2vec2-large-xlsr-korean")
model = Wav2Vec2ForCTC.from_pretrained("kresnik/wav2vec2-large-xlsr-korean").to('cuda')

Load Dataset

from datasets import load_dataset
ds = load_dataset("kresnik/zeroth_korean", "clean")
test_ds = ds['test']

Process Audio Data

import soundfile as sf
def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch
test_ds = test_ds.map(map_to_array)

Generate Transcriptions

import torch
def map_to_pred(batch):
    inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding="longest")
    input_values = inputs.input_values.to("cuda")

    with torch.no_grad():
        logits = model(input_values).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = test_ds.map(map_to_pred, batched=True, batch_size=16, remove_columns=["speech"])

Evaluate Performance

from jiwer import wer
print("WER:", wer(result["text"], result["transcription"]))

Cloud GPUs

For optimal performance, especially when dealing with large datasets, it is recommended to use cloud GPU services such as Google Colab, AWS EC2 with GPU instances, or Azure's GPU VMs.

License

The Wav2Vec2-Large-XLSR-Korean model is released under the Apache 2.0 License, which permits use, distribution, and modification with appropriate attribution.

More Related APIs in Automatic Speech Recognition