wav2vec2 large xlsr 53 chinese zh cn

jonatasgrosman

XLSR Wav2Vec2 Chinese (zh-CN) by Jonatas Grosman

Introduction

The XLSR Wav2Vec2 Chinese model is a fine-tuned version of Facebook's wav2vec2-large-xlsr-53 specifically for Chinese speech recognition. The model uses the Common Voice, CSS10, and ST-CMDS datasets and is designed to transcribe audio input sampled at 16kHz.

Architecture

The model builds upon the wav2vec2-large-xlsr-53 architecture, which is part of the Wav2Vec2 series. This architecture is known for its efficacy in automatic speech recognition (ASR) tasks. It integrates a convolutional feature encoder and a transformer-based context network to capture audio features effectively.

Training

Training was conducted using the Common Voice 6.1, CSS10, and ST-CMDS datasets. The training process was facilitated by GPU credits from OVHcloud. The model was evaluated using Word Error Rate (WER) and Character Error Rate (CER) metrics, achieving a WER of 82.37% and a CER of 19.03%.

Guide: Running Locally

Basic Steps

  1. Install Dependencies: Ensure you have torch, librosa, datasets, and transformers libraries installed.
  2. Load Model and Processor:
    from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
    model = Wav2Vec2ForCTC.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn")
    processor = Wav2Vec2Processor.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn")
    
  3. Prepare Audio Data: Load your audio files using librosa and ensure they are sampled at 16kHz.
  4. Transcribe Audio:
    inputs = processor(audio_data, sampling_rate=16_000, return_tensors="pt", padding=True)
    logits = model(inputs.input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcriptions = processor.batch_decode(predicted_ids)
    
  5. Evaluate Model: Use the wer.py and cer.py scripts for evaluation, ensuring proper preprocessing of test datasets.

Suggest Cloud GPUs

Consider using cloud services such as AWS, Google Cloud, or OVHcloud for GPU support to enhance processing speed and manage large datasets efficiently.

License

The model is licensed under the Apache 2.0 License, allowing for both personal and commercial use with proper attribution.

More Related APIs in Automatic Speech Recognition