unispeech sat base 100h libri ft
microsoftIntroduction
The UniSpeech-SAT-Base-100H-Libri-FT model, developed by Microsoft, is a fine-tuned automatic speech recognition model. It is based on the UniSpeech architecture and has been trained on 100 hours of LibriSpeech data. The model is optimized for 16kHz sampled speech audio.
Architecture
The model uses the UniSpeech architecture which incorporates self-supervised learning (SSL) for speech processing. It features multi-task learning and utterance mixing strategies to enhance speaker representation learning. This framework integrates the utterance-wise contrastive loss with the SSL objective, aiming at improved speaker discrimination.
Training
The UniSpeech-SAT model was fine-tuned on 100 hours of the LibriSpeech dataset. The training process involves strategies such as multi-task learning and data augmentation through utterance mixing, which help in extracting unsupervised speaker information. The methods were integrated into the HuBERT framework, and experiments demonstrated state-of-the-art performance in universal representation learning.
Guide: Running Locally
To use the model for transcribing audio files:
-
Install Libraries: Ensure you have the
transformers
anddatasets
libraries installed.pip install transformers datasets
-
Load the Model and Processor:
from transformers import Wav2Vec2Processor, UniSpeechSatForCTC processor = Wav2Vec2Processor.from_pretrained("microsoft/unispeech-sat-base-100h-libri-ft") model = UniSpeechSatForCTC.from_pretrained("microsoft/unispeech-sat-base-100h-libri-ft")
-
Load an Example Dataset:
from datasets import load_dataset ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
-
Tokenize and Predict:
import torch input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values logits = model(input_values).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids)
-
Cloud GPUs: For improved performance, consider using cloud GPUs from platforms like Google Cloud, AWS, or Azure.
License
The model is licensed under the Apache 2.0 License. The official license can be accessed here.