wav2vec2 large xlsr 53 english
jonatasgrosmanIntroduction
The wav2vec2-large-xlsr-53-english
model is a fine-tuned version of Facebook's Wav2Vec2 model for automatic speech recognition (ASR) in English. It was trained using the Common Voice 6.1 dataset and is available for use in various ASR tasks. The model is designed to process audio sampled at 16kHz.
Architecture
The model is based on the Wav2Vec2 architecture, which is known for its ability to handle speech data efficiently. It leverages the XLSR (cross-lingual speech representations) capabilities, making it robust for speech recognition tasks. The model supports inference using libraries such as PyTorch and Transformers.
Training
Training was conducted on the Common Voice 6.1 dataset, focusing on English language data. The model was fine-tuned using GPU credits provided by OVHcloud. Key metrics for evaluation include Word Error Rate (WER) and Character Error Rate (CER), with performance measured both with and without a language model.
Guide: Running Locally
To use the model locally, follow these steps:
- Install Dependencies: Ensure you have Python, PyTorch, and Transformers installed.
- Load the Model: Use the Hugging Face Transformers library to load the model and processor.
- Prepare Audio Data: Ensure your audio files are sampled at 16kHz.
- Transcribe Audio: Use the model to transcribe your audio files into text.
Here is a basic example using the huggingsound
library:
from huggingsound import SpeechRecognitionModel
model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
transcriptions = model.transcribe(audio_paths)
For more intensive tasks, consider using cloud GPUs from providers like AWS or Google Cloud to accelerate processing.
License
The model is released under the Apache 2.0 license, allowing for both commercial and non-commercial use.