unispeech sat base 100h libri ft

microsoft

Introduction

The UniSpeech-SAT-Base-100H-Libri-FT model, developed by Microsoft, is a fine-tuned automatic speech recognition model. It is based on the UniSpeech architecture and has been trained on 100 hours of LibriSpeech data. The model is optimized for 16kHz sampled speech audio.

Architecture

The model uses the UniSpeech architecture which incorporates self-supervised learning (SSL) for speech processing. It features multi-task learning and utterance mixing strategies to enhance speaker representation learning. This framework integrates the utterance-wise contrastive loss with the SSL objective, aiming at improved speaker discrimination.

Training

The UniSpeech-SAT model was fine-tuned on 100 hours of the LibriSpeech dataset. The training process involves strategies such as multi-task learning and data augmentation through utterance mixing, which help in extracting unsupervised speaker information. The methods were integrated into the HuBERT framework, and experiments demonstrated state-of-the-art performance in universal representation learning.

Guide: Running Locally

To use the model for transcribing audio files:

  1. Install Libraries: Ensure you have the transformers and datasets libraries installed.

    pip install transformers datasets
    
  2. Load the Model and Processor:

    from transformers import Wav2Vec2Processor, UniSpeechSatForCTC
    processor = Wav2Vec2Processor.from_pretrained("microsoft/unispeech-sat-base-100h-libri-ft")
    model = UniSpeechSatForCTC.from_pretrained("microsoft/unispeech-sat-base-100h-libri-ft")
    
  3. Load an Example Dataset:

    from datasets import load_dataset
    ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
    
  4. Tokenize and Predict:

    import torch
    input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values
    logits = model(input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    
  5. Cloud GPUs: For improved performance, consider using cloud GPUs from platforms like Google Cloud, AWS, or Azure.

License

The model is licensed under the Apache 2.0 License. The official license can be accessed here.

More Related APIs in Automatic Speech Recognition