wav2vec2 large xlsr korean
kresnikWav2Vec2-Large-XLSR-Korean
Introduction
The Wav2Vec2-Large-XLSR-Korean model is a specialized model for automatic speech recognition (ASR) in the Korean language. It utilizes the Wav2Vec2 architecture, optimized for processing Korean speech data, and achieves notable performance metrics with a Word Error Rate (WER) of 4.74% and a Character Error Rate (CER) of 1.78% on the Zeroth Korean dataset.
Architecture
This model leverages the Wav2Vec2 architecture, which is a transformer-based model designed for speech recognition tasks. It processes audio inputs to generate text transcriptions, making it suitable for converting spoken Korean into written form. The model is built using PyTorch and is compatible with the Hugging Face Transformers library.
Training
The model was trained on the Zeroth Korean dataset, focusing on clean speech data to enhance its accuracy for Korean ASR tasks. The training process involved fine-tuning the Wav2Vec2 architecture on this specific dataset, ensuring that the model could accurately transcribe Korean speech with minimal errors.
Guide: Running Locally
To run the Wav2Vec2-Large-XLSR-Korean model locally, follow these steps:
-
Set Up Environment
- Install the required Python libraries:
transformers
,datasets
,soundfile
, andtorch
.
- Install the required Python libraries:
-
Load the Model and Processor
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor processor = Wav2Vec2Processor.from_pretrained("kresnik/wav2vec2-large-xlsr-korean") model = Wav2Vec2ForCTC.from_pretrained("kresnik/wav2vec2-large-xlsr-korean").to('cuda')
-
Load Dataset
from datasets import load_dataset ds = load_dataset("kresnik/zeroth_korean", "clean") test_ds = ds['test']
-
Process Audio Data
import soundfile as sf def map_to_array(batch): speech, _ = sf.read(batch["file"]) batch["speech"] = speech return batch test_ds = test_ds.map(map_to_array)
-
Generate Transcriptions
import torch def map_to_pred(batch): inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding="longest") input_values = inputs.input_values.to("cuda") with torch.no_grad(): logits = model(input_values).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) batch["transcription"] = transcription return batch result = test_ds.map(map_to_pred, batched=True, batch_size=16, remove_columns=["speech"])
-
Evaluate Performance
from jiwer import wer print("WER:", wer(result["text"], result["transcription"]))
Cloud GPUs
For optimal performance, especially when dealing with large datasets, it is recommended to use cloud GPU services such as Google Colab, AWS EC2 with GPU instances, or Azure's GPU VMs.
License
The Wav2Vec2-Large-XLSR-Korean model is released under the Apache 2.0 License, which permits use, distribution, and modification with appropriate attribution.