wav2vec2 large xlsr 53 english

jonatasgrosman

Introduction

The wav2vec2-large-xlsr-53-english model is a fine-tuned version of Facebook's Wav2Vec2 model for automatic speech recognition (ASR) in English. It was trained using the Common Voice 6.1 dataset and is available for use in various ASR tasks. The model is designed to process audio sampled at 16kHz.

Architecture

The model is based on the Wav2Vec2 architecture, which is known for its ability to handle speech data efficiently. It leverages the XLSR (cross-lingual speech representations) capabilities, making it robust for speech recognition tasks. The model supports inference using libraries such as PyTorch and Transformers.

Training

Training was conducted on the Common Voice 6.1 dataset, focusing on English language data. The model was fine-tuned using GPU credits provided by OVHcloud. Key metrics for evaluation include Word Error Rate (WER) and Character Error Rate (CER), with performance measured both with and without a language model.

Guide: Running Locally

To use the model locally, follow these steps:

  1. Install Dependencies: Ensure you have Python, PyTorch, and Transformers installed.
  2. Load the Model: Use the Hugging Face Transformers library to load the model and processor.
  3. Prepare Audio Data: Ensure your audio files are sampled at 16kHz.
  4. Transcribe Audio: Use the model to transcribe your audio files into text.

Here is a basic example using the huggingsound library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

For more intensive tasks, consider using cloud GPUs from providers like AWS or Google Cloud to accelerate processing.

License

The model is released under the Apache 2.0 license, allowing for both commercial and non-commercial use.

More Related APIs in Automatic Speech Recognition