wav2vec2 large xlsr 53 portuguese
facebookIntroduction
The wav2vec2-large-xlsr-53-portuguese
model by Facebook AI is a pre-trained model for Automatic Speech Recognition (ASR) tasks, specifically designed to process Portuguese audio. It leverages the Wav2Vec2 architecture and is fine-tuned on the Common Voice dataset for Portuguese.
Architecture
This model is based on the Wav2Vec2 architecture, which is particularly effective for speech recognition tasks. It processes audio data to produce transcriptions using a self-supervised learning approach, allowing for robust performance even with limited labeled data.
Training
The model has been fine-tuned on the Common Voice dataset's Portuguese subset. Evaluation on the test set of this dataset yields a Word Error Rate (WER) of 27.1%. The training process involves data preprocessing steps such as audio resampling and text normalization to ignore certain characters.
Guide: Running Locally
To run this model locally, follow these steps:
-
Set up the environment:
- Install PyTorch, Hugging Face Transformers, torchaudio, and datasets libraries.
- Ensure you have access to a CUDA-enabled GPU for optimal performance.
-
Load the model and processor:
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-xlsr-53-portuguese").to("cuda") processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-xlsr-53-portuguese")
-
Prepare the dataset:
- Download and preprocess the Common Voice dataset for Portuguese.
- Resample audio and clean text data.
-
Inference:
- Use the model to predict transcriptions from the audio data.
- Compute the WER to evaluate performance.
For those without local GPU resources, consider using cloud-based GPU services such as AWS, Google Cloud Platform, or Azure.
License
The wav2vec2-large-xlsr-53-portuguese
model is licensed under the Apache 2.0 License, which allows for both commercial and non-commercial use with minimal restrictions.