wav2vec2 base 960h

facebook

Introduction

Wav2Vec2-Base-960H is a speech recognition model developed by Facebook AI, designed to transcribe audio files into text. It is pretrained and fine-tuned on 960 hours of LibriSpeech data. The model operates on 16kHz sampled speech audio and has shown significant performance in reducing Word Error Rate (WER) even with limited labeled data.

Architecture

The Wav2Vec2 model learns representations from raw audio through a contrastive task defined over a quantization of the latent representations. It masks the speech input in the latent space, enabling it to learn powerful representations from unlabeled speech data. The model has been evaluated with impressive WER results, achieving 1.8/3.3 on clean/other test sets with full labeled data and maintaining performance with reduced labeled data.

Training

Wav2Vec2-Base-960H was pretrained on a large amount of unlabeled data and fine-tuned on labeled data from the LibriSpeech dataset. This approach allows the model to perform well even with minimal labeled data, demonstrating its effectiveness in semi-supervised learning scenarios.

Guide: Running Locally

To run the Wav2Vec2-Base-960H model locally:

  1. Install Dependencies: Ensure you have transformers, datasets, and torch installed in your Python environment.
  2. Load Model and Processor:
    from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
    processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
    model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
    
  3. Load Dataset: Use the datasets library to load audio data.
    from datasets import load_dataset
    ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
    
  4. Transcription: Process and transcribe audio using the loaded model.
    input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values
    logits = model(input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    

For enhanced performance, consider using a cloud GPU from providers like AWS or Google Cloud Platform to handle the computational demands.

License

The Wav2Vec2-Base-960H model is licensed under the Apache-2.0 License, which permits use, distribution, and modification under certain conditions.

More Related APIs in Automatic Speech Recognition