wav2vec2 large robust ft libri 960h

facebook

Introduction

The wav2vec2-large-robust-ft-libri-960h model is an automatic speech recognition (ASR) model fine-tuned by Facebook on the Librispeech dataset. It is designed to transcribe audio inputs into text and is based on the Wav2Vec2 architecture.

Architecture

This model is a fine-tuned version of Wav2Vec2, initially pre-trained on diverse datasets, including Libri-Light, CommonVoice, Switchboard, and Fisher. It has been refined using 960 hours of Librispeech data. Wav2Vec2 uses a self-supervised learning approach to understand speech structures from raw audio.

Training

The model was pre-trained on a variety of audio datasets to enhance its robustness across different audio domains. The pre-training involved unlabeled audio data, and fine-tuning was performed on labeled data from the Librispeech dataset. This approach allows the model to generalize well across various domains and improve its performance on the target domain.

Guide: Running Locally

  1. Install Dependencies: Ensure you have Python and PyTorch installed. Install the transformers and datasets libraries via pip:

    pip install transformers datasets soundfile torch
    
  2. Load Model and Processor:

    from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
    processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-robust-ft-libri-960h")
    model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-robust-ft-libri-960h")
    
  3. Prepare Data: Load your audio files and convert them to arrays using a library like soundfile.

  4. Tokenize and Infer: Convert audio data into tensors and pass them through the model to obtain transcriptions.

    import torch
    input_values = processor(your_audio_data, return_tensors="pt", padding="longest").input_values
    logits = model(input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    
  5. Suggest Cloud GPUs: Consider using cloud-based GPUs from providers like AWS, Google Cloud, or Azure for faster inference, especially when processing large datasets.

License

This model is licensed under the Apache-2.0 License, allowing for wide use and distribution in both private and commercial applications.

More Related APIs in Automatic Speech Recognition