wav2vec2 large xlsr 53 spanish

facebook

Introduction

The wav2vec2-large-xlsr-53-spanish model is a pretrained automatic speech recognition (ASR) model from Facebook AI, designed for processing Spanish language audio. It employs the Wav2Vec 2.0 architecture and is trained on the Common Voice dataset.

Architecture

The model utilizes the Wav2Vec 2.0 architecture, which is effective for speech recognition tasks. It processes raw audio waveforms to predict transcriptions without requiring an extensive amount of labeled data. This version is specifically fine-tuned for Spanish using the XLSR (Cross-Lingual Speech Representations) model with 53 languages.

Training

The model has been fine-tuned on the Spanish subset of the Common Voice dataset. The training process involved resampling audio data to match the model's expected input frequency and preprocessing text by removing punctuation and converting it to lowercase. The model's performance was evaluated using the Word Error Rate (WER) metric, achieving a result of 17.6%.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Environment Setup: Install the necessary libraries:

    pip install torch torchaudio transformers datasets
    
  2. Load the Model and Processor:

    from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
    model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-xlsr-53-spanish").to("cuda")
    processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-xlsr-53-spanish")
    
  3. Dataset Preparation:

    from datasets import load_dataset
    ds = load_dataset("common_voice", "es", split="test")
    
  4. Audio Resampling:

    import torchaudio
    resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)
    
  5. Processing and Prediction: Use the pre-trained model to process and predict the speech data, then compute the Word Error Rate (WER).

For optimal performance, using a cloud GPU, such as those available on AWS, Google Cloud, or Azure, is recommended.

License

The model is released under the Apache 2.0 License, allowing for free use, modification, and distribution, provided that any copies distributed include the same license.

More Related APIs in Automatic Speech Recognition