wav2vec2 large xlsr 53 chinese zh cn
jonatasgrosmanXLSR Wav2Vec2 Chinese (zh-CN) by Jonatas Grosman
Introduction
The XLSR Wav2Vec2 Chinese model is a fine-tuned version of Facebook's wav2vec2-large-xlsr-53 specifically for Chinese speech recognition. The model uses the Common Voice, CSS10, and ST-CMDS datasets and is designed to transcribe audio input sampled at 16kHz.
Architecture
The model builds upon the wav2vec2-large-xlsr-53 architecture, which is part of the Wav2Vec2 series. This architecture is known for its efficacy in automatic speech recognition (ASR) tasks. It integrates a convolutional feature encoder and a transformer-based context network to capture audio features effectively.
Training
Training was conducted using the Common Voice 6.1, CSS10, and ST-CMDS datasets. The training process was facilitated by GPU credits from OVHcloud. The model was evaluated using Word Error Rate (WER) and Character Error Rate (CER) metrics, achieving a WER of 82.37% and a CER of 19.03%.
Guide: Running Locally
Basic Steps
- Install Dependencies: Ensure you have
torch
,librosa
,datasets
, andtransformers
libraries installed. - Load Model and Processor:
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor model = Wav2Vec2ForCTC.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn") processor = Wav2Vec2Processor.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn")
- Prepare Audio Data: Load your audio files using
librosa
and ensure they are sampled at 16kHz. - Transcribe Audio:
inputs = processor(audio_data, sampling_rate=16_000, return_tensors="pt", padding=True) logits = model(inputs.input_values).logits predicted_ids = torch.argmax(logits, dim=-1) transcriptions = processor.batch_decode(predicted_ids)
- Evaluate Model: Use the
wer.py
andcer.py
scripts for evaluation, ensuring proper preprocessing of test datasets.
Suggest Cloud GPUs
Consider using cloud services such as AWS, Google Cloud, or OVHcloud for GPU support to enhance processing speed and manage large datasets efficiently.
License
The model is licensed under the Apache 2.0 License, allowing for both personal and commercial use with proper attribution.