wav2vec2 large xlsr 53 portuguese

facebook

Introduction

The wav2vec2-large-xlsr-53-portuguese model by Facebook AI is a pre-trained model for Automatic Speech Recognition (ASR) tasks, specifically designed to process Portuguese audio. It leverages the Wav2Vec2 architecture and is fine-tuned on the Common Voice dataset for Portuguese.

Architecture

This model is based on the Wav2Vec2 architecture, which is particularly effective for speech recognition tasks. It processes audio data to produce transcriptions using a self-supervised learning approach, allowing for robust performance even with limited labeled data.

Training

The model has been fine-tuned on the Common Voice dataset's Portuguese subset. Evaluation on the test set of this dataset yields a Word Error Rate (WER) of 27.1%. The training process involves data preprocessing steps such as audio resampling and text normalization to ignore certain characters.

Guide: Running Locally

To run this model locally, follow these steps:

  1. Set up the environment:

    • Install PyTorch, Hugging Face Transformers, torchaudio, and datasets libraries.
    • Ensure you have access to a CUDA-enabled GPU for optimal performance.
  2. Load the model and processor:

    from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
    model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-xlsr-53-portuguese").to("cuda")
    processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-xlsr-53-portuguese")
    
  3. Prepare the dataset:

    • Download and preprocess the Common Voice dataset for Portuguese.
    • Resample audio and clean text data.
  4. Inference:

    • Use the model to predict transcriptions from the audio data.
    • Compute the WER to evaluate performance.

For those without local GPU resources, consider using cloud-based GPU services such as AWS, Google Cloud Platform, or Azure.

License

The wav2vec2-large-xlsr-53-portuguese model is licensed under the Apache 2.0 License, which allows for both commercial and non-commercial use with minimal restrictions.

More Related APIs in Automatic Speech Recognition